Special Theme:: Ercim News
Special Theme:: Ercim News
ERCIM
www.ercim.eu
NEWS
Special theme:
Big Data
Also in this issue:
Keynote
E-Infrastructures for Big Data
by Kostas Glinos
The views expressed are those of the author and do not necessarily represent the official view of the European Commission on the subject.
27 Big Software Data Analysis 41 Bionic Packaging: A Promising Paradigm for Future
by Mircea Lungu, Oscar Nierstrasz and Niko Schwarz Computing
by Patrick Ruch Thomas Brunschwiler, Werner Escher,
29 Scalable Management of Compressed Semantic Big Stephan Paredes and Bruno Michel
Data
by Javier D. Fernández, Miguel A. Martínez-Prieto and 43 NanoICT: A New Challenge for ICT
Mario Arias by Mario D’Acunto, Antonio Benassi, Ovidio Salvetti
30 SCAPE: Big Data Meets Digital Preservation 44 Information Extraction from Presentation-Oriented
by Ross King, Rainer Schmidt, Christoph Becker and Documents
Sven Schlarb by Massimo Ruffolo and Ermelinda Oro
52 Announcements
55 Books
55 In Brief
Industrial Systems
Institute/RC ‘Athena’
Joins ERCIM
by Dimitrios Serpanos
ISI is the main research institute in Greece that focuses on Research and development in ICT and applied mathematics
cutting edge R&D that applies to the industrial and enterprise is a strategic element for ISI’s sustainable growth. We expect
environment. Its main goal is sustainable innovation in infor- that ERCIM will enable ISI to increase its contribution
mation, communication and knowledge technologies for through stronger networking and cooperation with important
improving the competiveness of the Greek industry while institutions, members of ERCIM, and dissemination of its
applying environmentally friendly practices. results and achievements through joint activities. Three
members of its staff will represent ISI in ERCIM, namely
ISI is involved in a range of research areas such as industrial Professor Dimitrios Serpanos (Director of ISI), as the repre-
information and communication systems; enterprise systems sentative in the ERCIM general member assembly, with Dr
and enterprise integration; embedded systems in several Nikolaos Zervos (Researcher) as his substitute and Dr
application areas, including transport, healthcare, and Artemios Voyiatzis (Researcher) represents ISI on the
nomadic environments; enterprise and industrial process ERCIM News editorial board.
modelling; safety and security; reliability and cyber-security.
Link:
Our vision for ISI is to sustain a leading role in R&D of inno- http://www.isi.gr/en/
vative industrial and enterprise technologies. ISI has made
significant contributions at the regional, national, and Please contact:
European levels and has established strong relationships and Dimitrios Serpanos
collaboration with R&D and industrial stakeholders in Industrial Systems Institute/RC ‘Athena’, Greece
Greece, Europe, and the USA. E-mail: [email protected], Tel: +30 261 091 0301
Research at ISI
ISI performs high-impact research in the following - collaboration platforms, virtual and extended • computer systems organization (processor architec-
areas: enterprise, enterprise clustering ture, computer communication networks, perform-
• information and communication systems for the - collaborative manufacturing ance of systems, computer system implementation)
industry and enterprise environment • modelling and automation of industrial systems • software (software engineering, operating systems)
- high-performance communication systems and and processes • mathematics of computing (discrete mathematics,
architectures - control of industrial robots probability and statistics)
- real-time communications - control of mobile robots and autonomous vehicles • computing methodologies (artificial intelligence,
- low-power hardware architectures for process- - control of distributed industrial systems image processing and computer vision, pattern
ing and communication - control of complex electromechanical systems recognition, simulation and modelling)
• embedded systems - system modeling and fault prediction, detec- • computer applications (physical sciences and engi-
- design and architecture tion, and isolation neering, life and medical science, computer-aided
- interoperability - intelligent and adaptive systems engineering)
- design tools • sustainable development • computers and society (public policy issues, elec-
- real-time cooperation and coordination - ICT for energy-efficient systems tronic commerce, security)
- safety and reliability - ICT for increasing sustainable energy aware-
• cyber-security ness ISI has produced successful products and services
• electronic systems - sustainable (multimodal) transport such as:
- RFID • an innovative wireless network system for real-
• enterprise integration The R&D activities of ISI draw upon many areas of time industrial communications
- next-generation control systems mathematics, computer science and engineering • system for sea protection from ship wreckage oil spill
- software agents including: • system for earthquake monitoring and disaster res-
- ontologies for enterprise processes, resources, • hardware (memory structures, I/O and data com- cue assistance
and products munications, logic design, integrated circuit, per- • integrated building management system
- service-Oriented Architecture formance and reliability) • a smart camera system for surveillance.
International Workshop on
Computational Intelligence
for Multimedia
Understanding
by Emanuele Salerno
Ovidio Salvetti and Emanuele Salerno open the workshop.
The National Research Council of Italy in Pisa hosted the
International Workshop on Computational Intelligence oriented talks, covering a very wide range of subjects, were
for Multimedia Understanding, organized by the ERCIM given. Papers on applications included both the development
Working Group on Multimedia Understanding through of basic concepts towards the real-world practice and the
Semantics, Computation and Learning (Muscle), 11-13 actual realization of working systems. Most contributions
December 2011. The workshop was co-sponsored by focussed on still images, but there were some on video and
CNR and Inria. Proceedings to come in LNCS 7252 text. Two presentations exploited results from studies on
human visual perception for image and video analysis and
Computational intelligence is becoming increasingly impor- synthesis. Two greatly appreciated invited lectures were
tant in our households, and in all commercial, industrial and given by Sanni Siltanen, VTT, former Muscle chair, and
scientific environments. A host of information sources, dif- Bülent Sankur, Bogaziçi University, Istanbul. These dealt
ferent in nature, format, reliability and content are normally with emotion detection from facial images (Sankur) and
available at virtually no expense. Analyzing raw data to pro- advanced applications of augmented reality (Siltanen).
vide them with semantics is essential to exploit their full
potential and help us manage our everyday tasks. This is also The Muscle group has more than 50 members from research
important for data description for storage and mining. groups in 15 countries. Their expertise ranges from machine
Interoperability and exchangeability of heterogeneous and dis- learning and artificial intelligence to statistics, signal pro-
tributed data is an essential requirement for any practical cessing and multimedia database management. The goal of
application. Semantics is information at the highest level, and Muscle is to foster international cooperation in multimedia
inferring it from raw data (that is, from information at the research and carry out training initiatives for young
lowest level) entails exploiting both data and prior information researchers.
to extract structure and meaning. Computation, machine
learning, ontologies, statistical and Bayesian methods are tools Links:
to achieve this goal. They were all discussed in this workshop. http://muscle.isti.cnr.it/pisaworkshop2011/
http://wiki.ercim.eu/wg/MUSCLE/
30 participants from eleven countries attended the work-
shop. With the help of more than 20 external reviewers, the Please contact:
program committee ensured a high scientific level, accepting Emanuele Salerno, MUSCLE WG chair, ISTI-CNR, Italy
19 papers for presentation. Both theoretical and application- Tel: +39 050 315 3137; E-mail: [email protected]
In 2006 and 2008, Keith Jeffery from for Programming Languages and Systems),
The European STFC, UK, led expert groups on the topic and EASST (European Association of
at the request of the European Commission. Software Science and Technology) was
Forum for ICST The commission initiated a project to agreed, with Jan van Leeuwen, Utrecht
survey the views of ICST professionals in University representing Informatics
is Taking Shape Europe; significantly this survey indicated Europe, taking the chair, and ERCIM pro-
that ERCIM was the best known and viding the coordinating website. On 2
Since 2006 there has been discussion regarded organisation in Europe. Working February 2012, a meeting was hosted at
about creating a platform for cooperation closely with Informatics Europe and the Inria, Paris where the objectives and struc-
among the European societies in European Chapter of the ACM, ERCIM ture of the European Forum for ICST
Information and Communication has pushed steadily for progress. A meeting (EFICST) were defined and agreed, with
Sciences and Technologies (ICST). The at the 2011 Informatics Europe annual con- the vice presidents Keith Jeffery, STFC rep-
objective is to have a stronger, unified ference confirmed a widespread desire to resenting ERCIM, and Paul Spirakis repre-
voice for ICST professionals in Europe. form such an association. An initial constel- senting EATCS and ACM Europe.
The forum will help to develop common lation of Informatics Europe, ERCIM, the
viewpoints and strategies for ICST in European Chapter of ACM, CEPIS Please contact:
Europe and, whenever appropriate or (Council of European Professional Keith Jeffery, STFC, UK
needed, a common representation of Informatics Societies), EATCS (European ERCIM president
these viewpoints and strategies at the Association for Theoretical Computer and EFICST vice-president
international level. Science), EAPLS (European Association E-mail: [email protected]
Big Data
by Costantino Thanos, Stefan Manegold and Martin Kersten
Big data' refers to data sets whose size is beyond the capabilities of the cur-
rent database technology.
The current data deluge is revolutionizing the way research is carried out
and resulting in the emergence of a new fourth paradigm of science based
on data-intensive computing. This new data-dominated science will lead to
a new data-centric way of conceptualizing, organizing and carrying out
research activities which could lead to an introduction of new approaches
to solve problems that were previously considered extremely hard or, in
some cases, impossible to solve and also lead to serendipitous discoveries.
The recent availability of huge amounts of data, along with advanced tools
of exploratory data analysis, data mining/machine learning and data visual-
ization, offers a whole new way of understanding the world.
In order to exploit these huge volumes of data, new techniques and tech-
nologies are needed. A new type of e-infrastructure, the Research Data
Infrastructure, must be developed to harness the accumulation of data and
knowledge produced by research communities, optimize the data move-
ment across scientific disciplines, enable large increases in multi- and inter-
disciplinary science while reducing duplication of effort and resources, and
integrating research data with published literature.
Science is a global undertaking and research data are both national and
global assets. A seamless infrastructure is needed to facilitate collaborative
arrangements necessary for the intellectual and practical challenges the
world faces.
The challenges faced include catching the user’s intention and providing the
users with suggestions and guidelines to refine their queries in order to
quickly converge to the desired results, but also call for novel database
architectures and algorithms that are designed with the intent to produce
fast and cheap indicative answers rather than complete and correct answers.
Please contact:
Costantino Thanos
ISTI-CNR Italy
E-mail: [email protected]
Invited article
As evidenced by a large and growing number of reports from research communities, research
funding agencies, and academia, there is growing acceptance of the assertion that science is
becoming more and more data-centric.
Data is pushed to the center by the scale On the campuses of research universi- ture for big data. We must approach it as
and diversity of data from computational ties there is widespread and growing a shared enterprise involving academia,
models, observatories, sensor networks, demand by researchers to create univer- government, for-profit and non-profit
and the trails of social engagement in sity-level, shared, professionally man- organizations, with multiple institutions
our current age of Internet-based con- aged data-storage services and associ- playing complementary roles within an
nection. It is pulled to the center by tech- ated services for data management. This incentive system that is sustainable both
nology and methods now called “the is being driven by: financially and technically.
big-data movement” or by some a • the general increase in the scale of
“fourth paradigm for discovery” that data and new methods for extracting At the federal government level, major
enables extracting knowledge from information and knowledge from research funding agencies including the
these data and then acting upon it. Vivid data, including that produced by National Science Foundation, the
examples of data analytics and its poten- other people National Institutes for Health, and the
tial to convert data to knowledge and • policies by research funders requiring Department of Energy, together with
then to action in many fields are found at that data should be an open resource several mission-based agencies are
http://www.cra.org/ccc/dan.php. Note that is available at no or low cost to developing, with encouragement from
that I am using the phrase “big data” to others over long periods of time the White House, a coordinated inter-
include both the stewardship of the data • privacy and export regulations on agency response to “big data.”
and the system of facilities and methods data that are beyond the capability of Although details will not be available
(including massive data centers) to the researcher to be in compliance for several months, the goals will be
extract knowledge from data. • the growing need to situate data and strategic and will include four linked
computational resources together to components: foundational research,
The focus of these comments is the fact make it easier for researchers to infrastructure, transformative applica-
that our current infrastructure - technolo- develop scientific applications, tions to research, and related training
gies, organizations, and sustainability increasingly as web service, on top of and education.
strategies - for the stewardship of digital the rich data store. They could poten-
data and codes is far from adequate to tially use their data, as well as shared Commercial partners could play mul-
support the vision of transformative data- data and tools from others to acceler- tiple roles in the big-data movement,
intensive discovery. I include both data ate discovery, democratize resources, especially by providing cloud-based
and programming codes because to the and yield more bang for the buck platforms for storing and processing big
extent that both are critical to research, from research funding. data. The commercial sector has pro-
both need to be curated and preserved to vided and will likely continue to pro-
sustain the fundamental tradition of the Although there are numerous successful vide resources at a scale well beyond
reproducibility of science. See for repository services for the scholarly lit- what can be provided by an individual
example an excellent case study about erature, most do not accommodate or even a university consortium. Major
reproducible research in the digital age at research data and codes. Furthermore, cloud service providers in the US have
http://stanford.edu/~vcs/papers/RRCiSE- as noted by leaders of an emerging strategies to build a line of business to
STODDEN2009.pdf. Digital Preservation Network (DPN) provide the cloud as a platform for big
being incubated by Internet 2, a US- data, and there is growing interest
The power of data mining and analytics based research and education network within both universities and the federal
increases the opportunity costs for not consortium, even the scholarship that is government in exploring sustainable
preserving data for reuse, particularly being produced today is at serious risk public-private partnerships.
for inherently longitudinal research such of being lost forever to future genera-
as global climate change. Multi-scale tions. There are many digital collections Initial efforts at collaboration between
and multi-disciplinary research often with a smattering of aggregation but all academia, government, and industry are
requires complex data federation and in are susceptible to multiple single points encouraging but great challenges
some fields careful vetting and creden- of failure. DPN aspires to do something remain to nurture the infrastructure nec-
tialing of data is critical. Appraisal and about this risk, including archival serv- essary to achieve the promise of the big-
curation is, at present at least, expensive ices for research data. data movement.
and labor intensive. Government
research-funding agencies are declaring No research funding agency, at least in
data from research to be a public good the US, has provided or is likely to pro- Please contact:
and requiring that it be made openly vide the enormous funding on a sus- Daniel E. Atkins
available to the public. But where will tained basis required to create and University of Michigan, USA
these data be stored and stewarded? maintain an adequate cyberinfrastruc- E-mail: [email protected]
SciDB is a native array DBMS that combines data management and mathematical operations
in a single system. It is an open source system that can be downloaded from SciDB.org
SciDB is an open-source DBMS ori- Access is provided through an array- run existing R scripts as well as to visu-
ented toward the data management version of SQL, which we term AQL. alize the result of SciDB queries.
needs of scientists. As such it mixes sta- AQL provides facilities for filtering
tistical and linear algebra operations arrays, joining arrays and aggregation Our storage manager divides arrays,
with data management ones, using a over the cell values in an array. which can be arbitrarily large, into
natural nested multi-dimensional array Moreover, Postgres-style user-defined storage “chunks” which are partitioned
data model. We have been working on scalar functions, as well as array func- across the nodes of a cluster and then
the code for three years, most recently tions are provided. allocated to disk blocks. Worthy chunks
with the help of venture capital backing. are also cached in main memory for
Currently, there are 14 full-time profes- In addition, SciDB contains pre-built faster access.
sionals working on the code base. popular mathematical functions, such as
matrix multiply, that operate in parallel We have benchmarked SciDB against
SciDB runs on Linux and manages data on multiple cores on a single node as Postgres on an astronomy-style work-
that can be spread over multiple nodes well as across nodes in a cluster. load that typifies the load provided by
in a computer cluster, connected by the Large Synoptic Survey Telescope
TCP/IP networking. Data is stored in Other notable features of SciDB include (LSST) project. On this benchmark,
the Linux file system on local disks con- a no-overwrite storage manager that SciDB outperforms Postgres by 2
nected to each node. Hence, it uses a retains old values of updated data, and orders of magnitude. We have also
“shared nothing” software architecture. provides Postgres-style “time travel” on benchmarked SciDB analytics against
the various versions of a cell. those in R. On a single core, we offer
The data model supported by SciDB is Moreover, we have extended SciDB comparable performance; however
multi-dimensional arrays, where each with support for multiple notions of SciDB scales linearly with additional
cell can contain a vector of values. “null”. Using this capability, users can cores and additional nodes, a character-
Moreover, dimensions can be either the distinguish multiple semantic notions, istic that does not apply to R.
standard integer ones or they can be such as “data is missing but it is sup-
user-defined data types with non- posed to be there” and “data is missing Early users of SciDB include the LSST
integer values, such as latitude and lon- and will be present within 24 hours”. project mentioned above, multiple high-
gitude. There is no requirement that Standard ACID transactions are sup- energy physics (HEP) projects, as well
arrays be rectangular; hence SciDB sup- ported, as is an interface to the statis- as commercial applications in genomics,
ports “ragged” arrays. tical package R, which can be used to insurance and financial services. SciDB
has been downloaded by about 1000
users from a variety of scientific and
commercial domains.
Invited article
Owing to the growing interest in digital methods within the humanities, an understanding of the
tenets of digitally based scholarship and the nature of specific data management issues in the
humanities is required. To this end the ESFRI roadmap on European infrastructures has been
seminal in identifying the need for a coordinating e-infrastructure in the humanities - DARIAH -
whose data policy is outlined in this paper.
Scholarly data in the humanities is a policy on standards and good prac- From a political point of view, we need
heterogeneous notion. Data creation, ie tices. Scholars should both benefit to discuss with potential data providers
the transcription of a primary docu- from strong initiatives such as the Text (cultural heritage entities, libraries or
ment, the annotation of existing sources Encoding Initiative (TEI) and stabilize even private sector stakeholders such
or the compilation of observations their experience by participating in the as Google) methods of creating a
across collections of objects, is inherent development of standards, in collabo- seamless data landscape where the fol-
to scholarly activity and thus makes it ration with other stakeholders (pub- lowing issues should be jointly
strongly dependant upon the actual lishers, cultural heritage institutions, tackled:
hypotheses or theoretical backgrounds libraries). • general reuse agreements for schol-
of the researcher. There is little notion ars, comprising usage in publica-
of data centre in the humanities since The vision also impacts on the tech- tions, presentation on web sites,
data production and enrichment are nical priorities for DARIAH, namely: integration (or referencing) in digital
anchored on the individuals performing • deploying a repository infrastructure editions, etc.;
research. where researchers can transparently • definition of standardized formats
and trustfully deposit their produc- and APIs that could make access to
DARIAH’s goal is to create a sound tions, comprising permanent identi- one or the other data provider more
and solid infrastructure to ensure the fication and access, targeted dissem- transparent;
long-term stability of digital assets, as ination (private, restricted and pub- • identification of scenarios by cover-
well as the development of a wide lic) and rights management, possibly ing the archival version of records as
range of thus-far unanticipated services in a semi-centralized way allowing well as scholarly created enrich-
to carry out research on these assets. efficiency, reliability and evolution ments. For example, TEI transcrip-
This comprises both technical aspects (cf. http://hal.archives-ouvertes.fr/ tions made by scholars could be
(identification, preservation), editorial hal-00399881); archived in the library where the pri-
(curation, standards) and sociological • defining standardized interfaces for mary source is situated.
(openness, scholarly recognition). accessing data through such reposi-
tories, but also through third-party As a whole, DARIAH should con-
This vision is underpinned by the data sources, with facilities such as tribute to excellence in research by
notion of digital surrogates, informa- threading, searching, selecting, visu- being seminal in the establishment of a
tion structures intended to identify, alizing and importing data; large coverage, coherent and acces-
document or represent a primary source • experimenting with the agile devel- sible data space for the humanities.
used in a scholarly work. Surrogates opment of virtual research spaces Whether acting at the level of stan-
can be metadata records, a scanned based on such services, integrating dards, education or core IT services,
image of a document, digital photo- community based research work- we should keep this vision in mind
graphs, transcription of a textual flows (see http://hal.inria.fr/inria- when setting priorities in areas that
source, or any kind of extract or trans- 00593677). will impact the sustainability of the
formation (eg the spectral analysis of a future digital ecology of scholars.
recorded speech signal) of existing Beyond the technical aspects, an ade-
data. Surrogates act as a stable refer- quate licensing policy must be defined Links:
ence for further scholarly work in to assert the legal conditions under ESFRI: ec.europa.eu/research/esfri/
replacement – or in complement – to which data assets can be disseminated.
the original physical source it repre- This should be a compromise between DARIAH: www.dariah.eu/
sents or describes. Moreover, a surro- making all publicly financed scholarly
gate can act as a primary source for the productions available in open access European report on scientific data:
creation of further surrogates, thus and preventing the adoption of hetero- cordis.europa.eu/fp7/ict/
forming a network that reflects the var- geneous reuse constraints and/or e-infrastructure/docs/hlg-sdi-report.pdf
ious steps of the scholarly workflow licensing models. We contemplate Text Encoding Initiative:
where sources are combined and encouraging the early dissemination of http://www.tei-c.org
enriched before being further dissemi- digital assets in the scholarly process
nated to a wider community. and recommend, when applicable, the Please contact:
use of a Creative Commons CC-BY Laurent Romary
Such a unified data landscape for license, that supports systematic attri- Inria, France
humanities research necessitates a clear bution (and thus citation) of the source. E-mail: [email protected]
One driver for the data tsunami is social networking companies such as FacebookTM which generate
terabytes of content. Facebook for instance, uploads three billion photos monthly for a total of 3,600
terabytes annually. The volume of social media is large, but not overwhelming. The data are
generated by a lot of humans, but each is limited in their rate of data production. In contrast, large
scientific facilities are another driver where the data are generated automatically.
In the 10 years to 2008, the largest cur- and a further 28PB between disk and the
rent astronomical catalogue, the Sloane batch farm. During intense periods of
Digital Sky Survey, produced 25 ter- data reprocessing internal network rates
abytes of data from telescopes. exceeded 60Gb/s for many hours.
By 2014, it is anticipated that the Large Storing and retrieving data is not that
Synoptic Survey Telescope will produce difficult - what’s hard is managing the
20 terabytes each night. By the year data, so that users can find what they
2019, The Square Kilometre Array radio want, and get it when they want. The
telescope is planned to produce 50 ter- limiting factor for Tier1 sites is the per-
abytes of processed data per day , from a formance of the databases. The RAL
raw data rate of 7000 terabytes per database stores a single 20 gigabyte
second. The designs for systems to table, representing the hierarchical file
manage the data from these next genera- structure, which performs about 500
tion scientific facilities are being based transactions per second across six clus-
on the data management used for the Figure 1: A snapshot of the monitor showing ters. In designing the service it is neces-
largest current scientific facility: the data analysis jobs being passed around the sary to reduce to a practical level the
CERN Large Hadron Collider. Worldwide LHC Computing Grid. number of disk operations to the data
tables and the log required for error
The Worldwide LHC Computing Grid recovery on each cluster. Multiple clus-
has provided the first global solution to scheduler responsible for choosing ters are used, but that introduces a com-
collecting and analyzing petabytes of sci- compute locations, moving data and munications delay between clusters to
entific data. CERN produces data as the scheduling jobs. ensure the integrity of the database, due
Tier0 site, which are distributed to 11 to the passing of data locking informa-
Tier1 sites around the world - including The data are made available to tion between them. Either the disk i/o or
the GRIDPP Tier-1 at STFC’s Rutherford researchers who submit jobs to analyse the inter-cluster communication
Appleton Laboratory (RAL) in the UK. datasets on Tier2 sites. They submit pro- becomes the limiting factor.
The CASTOR storage infrastructure used cessing jobs to a batch processing sched-
at RAL was designed at CERN to meet uler that states which data to analyse, In contrast, Facebook users require
the challenge of handling the high LHC and what analysis to perform. The immediate interactive responses, so
data rates and volume using commodity system schedules jobs for processing at batch schedulers cannot be used. They
hardware. CASTOR efficiently schedules the location that minimises data trans- use a waterfall architecture which, in
placement of files across multiple storage fers. The scheduler will copy the data to February 2011, ran 4,000 instances of
devices, and is particularly efficient at the compute location before the MySQL, but also required 9,000
managing tape access. The scientific analysis, but this transfer consumes con- instances of a database memory caching
metadata relating science to data-files is siderable communication bandwidth, system to speed up performance.
catalogued by each experiment centrally which reduces the response speed.
at CERN. The Tier1 sites operate data- Whatever the application, for large data
bases which identify on which disk or The Tier1 at RAL has 8PB of high band- volumes the problem remains how to
tape the data-file is stored. width disk storage for frequently used model the data and its usage so that the
data, in front of a tape robot with lower storage system can be appropriately
In science the priority is to capture the bandwidth, but a maximum capacity of designed to support the users perform-
data, because if it’s not stored it may be 100PB for long term data archiving. In ance demands.
lost, and the lost dataset may have been 2011, network rates between the Tier-1
the one that would have lead to a Nobel and wide area network averaged
Prize. Analysis is given secondary pri- 4.7Gb/s and peaked at over 20Gb/s; Please contact:
ority, since data can be analysed later, over 17PB of data was moved between Michael Wilson, STFC, UK
when it’s possible. Therefore the archi- the Tier-1 and other sites worldwide. Tel: +44 1235 446619
tecture that meets the user priorities is Internally, over the same period, 5PB of E-mail: [email protected]
based on effective storage, with a batch data was moved between disk and tape
The ability to explore huge digital resources assembled in data warehouses, databases and files, at
unprecedented speed, is becoming the driver of progress in science. However, existing database
management systems (DBMS) are far from capable of meeting the scientists’ requirements. The
Database Architectures group at CWI in Amsterdam cooperates with astronomers, seismologists
and other domain experts to tackle this challenge by advancing all aspects of database technology.
The group’s research results are disseminated via its open-source database system, MonetDB.
The heart of a scientific data warehouse the journey can be continued as long as partitioning the database based on
is its database system, running on a the user remains interested. science interest and resource avail-
modern distributed platform, and used ability. It extends traditional parti-
for both direct interaction with data We envision seven directions of long tioning schemes by taking into
gathered from experimental devices term research in database technology: account the areas of users’ interest
and management of the derived knowl- and the statistical stability in samples
edge using workflow software. • Data Vaults. Scientific data is usually drawn from the archives.
However, most (commercial) DBMS available in self-descriptive file for-
offerings cannot fulfill the demanding mats as produced by advanced scien- • Post-processing result sets. The often
needs of scientific data management. tific instruments. The need to convert huge results returned should not be
They fall short in one or more of the these formats into relational tables thrown at the user directly, but passed
following areas: multi-paradigm data and to explicitly load all data into the through an analytical processing
models (including support for arrays), DBMS forms a major hurdle for data- pipeline to condense the information
transparent data ingestion from, and base-supported scientific data analy- for human consumption. This
seamless integration of, scientific file sis. Instead, we propose the data involves computation intensive data
repositories, complex event processing, vault, a database-attached external mining techniques and harnessing the
and provenance. These topics only file repository. The data vault creates power of GPUs in the software stack
scratch the surface of the problem. The a true symbiosis between a DBMS of a DBMS.
state of the art in scientific data explo- and existing file-based repositories,
ration can be compared with our daily and thus provides transparent access • Query morphing. Given the impreci-
use of search engines. For a large part, to all data kept in the repository sion of the queries, the system should
search engines rely on guiding the user through the DBMS’s (array-based) aid in hinting at proximity results
from their ill-phrased queries through query language. using data distributions looked upon
successive refinement to the informa- during query evaluation. For exam-
tion of interest. Limited a priori knowl- • Array support. Scientific data man- ple, aside from the traditional row set,
edge is required. The sample answers agement calls for DBMSs that inte- it may suggest minor changes to the
returned provide guidance to drill grate the genuine scientific data query predicates to obtain non-empty
down, chasing individual links, or to model, multi-dimensional arrays, as results. The interesting data may be
adjust the query terms. first-class citizens next to relational ‘just around the corner’.
tables, and unified declarative lan-
The situation in scientific databases is guage as symbiosis of relational and • Queries as answers. Standing on the
more cumbersome than searching for linear algebra. Such support needs to shoulders of your peers involves
text, because they often contain complex be beyond ‘alien’ extensions that pro- keeping track of the queries, their
observational data, eg telescope images vide user defined functions. Rather, results, and resource requirements. It
of the sky, satellite images of the earth, arrays need to become first-class can be used as advice to modify ill-
time series or seismograms, and little a DBMS citizens next to relational phrased queries that could run for
priori knowledge exists. The prime chal- tables. hours producing meaningless results.
lenge is to find models that capture the
essence of this data at both a macro- and • One-minute database kernels. Such a Please contact:
micro-scale. The answer is in the data- kernel differs from conventional ker- Stefan Manegold
base, but the ‘Nobel-winning query’ is nels by identifying and avoiding per- CWI, The Netherlands
still unknown. formance degradation by answering E-mail: [email protected]
queries only partly within strict time
Next generation database management bounds. Run the query during a coffee
engines should provide a much richer break, look at the result, and continue
repertoire and ease of use experience to or abandon the data exploration path.
cope with the deluge of observational
data in a resource-limited setting. Good • Multi-scale query processing. Fast
is good enough as an answer, provided exploration of large datasets calls for
Modern science disciplines such as environmental science and astronomy must deal with
overwhelming amounts of experimental data. Such data must be processed (cleaned, transformed,
analyzed) in all kinds of ways in order to draw new conclusions and test scientific theories. Despite
their differences, certain features are common to scientific data of all disciplines: massive scale;
manipulated through large, distributed workflows; complexity with uncertainty in the data values, eg,
to reflect data capture or observation; important metadata about experiments and their provenance;
and mostly append-only (with rare updates). Furthermore, modern scientific research is highly
collaborative, involving scientists from different disciplines (eg biologists, soil scientists, and
geologists working on an environmental project), in some cases from different organizations in
different countries. Since each discipline or organization tends to produce and manage its own data
in specific formats, with its own processes, integrating distributed data and processes gets difficult
as the amounts of heterogeneous data grow.
In 2011, to address these challenges, we and may share data and applications “friendship”, same semantic domain,
started Zenith (http://www- while keeping full control over some of similar schema. The P2P data services
sop.inria.fr/teams/zenith/), a joint team their data (a major requirement for our include basic services (metadata and
between Inria and University application partners). But for very-large uncertain data management): recom-
Montpellier 2. Zenith is located at scale data analysis or very large work- mendation, data analysis and workflow
LIRMM in Montpellier, a city that flow activities, cloud computing is management through the Shared-data
enjoys a very strong position in environ- appropriate as it can provide virtually Overlay Network (SON) middleware.
mental science with major labs and infinite computing, storage and net- The cloud P2P services include data
groups working on related topics such as working resources. Such hybrid archi- mining, content-based information
agronomy, biodiversity, water hazard, tecture also enables the clean integra- retrieval and workflow execution.
land dynamics and biology. We are tion of the users’ own computational These services can be accessed through
developing our solutions by working resources with different clouds. web services, and each peer can use the
closely with scientific application part- services of multiple clouds.
ners such as CIRAD and INRA in Figure 1 illustrates Zenith’s architecture
agronomy. with P2P data services and cloud data Let us illustrate two recent results
services. We model an online scientific obtained in this context. The first is the
Zenith adopts a hybrid P2P/cloud archi- community as a set of peers and rela- design and implementation of P2Prec
tecture. P2P naturally supports the col- tionships between them. The peers have (http://www-
laborative nature of scientific applica- their own data sources. The relation- sop.inria.fr/teams/zenith/p2prec/), a
tions, with autonomy and decentralized ships are between any two or more recommendation service for P2P con-
control. Peers can be the participants or peers and indicate how the peers and tent sharing that exploits users’ social
organizations involved in collaboration their data sources are related, eg data. In our approach, recommendation
is based on explicit personalization, by among peers, we propose new by relations and workflow activities are
exploiting the scientists’ social net- semantic-based gossip protocols. mapped to operators that have data
works, using gossip protocols that scale Furthermore, P2Prec has the ability to aware semantics. Our execution model
well. Relevance measures may be get reasonable recall with acceptable is based on the concept of activity acti-
expressed based on similarities, users’ query processing load and network vation, which enables transparent distri-
confidence, document popularity, rates, traffic. bution and parallelization of activities.
etc., and combined to yield different Using both a real oil exploitation appli-
recommendation criteria. With P2Prec, The second result deals with the effi- cation and synthetic data scenarios, our
each user can identify data (documents, cient processing of scientific workflows experiments demonstrate major per-
annotations, datasets, etc.) provided by that are computationally and data-inten- formance improvements compared to
others and send queries to them. For sive, thus requiring execution in large- an ad-hoc workflow implementation.
instance, one may want to know which scale parallel computers. We propose an
scientists are expert in a topic and get algebraic approach (inspired by rela-
documents highly rated by them. Or tional algebra) and a parallel execution Please contact:
another may look for the best datasets model that enable automatic optimiza- Patrick Valduriez
used by others for some experiment. To tion of scientific workflows. With our Inria, France
efficiently disseminate information algebra, data are uniformly represented E-mail: [email protected]
Process mining provides new ways to analyze the performance of clinical processes based on large
amounts of event data recorded at run-time.
Hospitals and healthcare organizations of process, and expect that medical staff according to a prescribed methodology,
around the world are collecting increas- comply. Now, with process mining to produce understandable, useful, and
ingly vast amounts of data about their techniques, it is possible to analyze the often surprising results.
patients and the clinical processes they actual run-time behaviour of such
go through. At the same time - especially processes and obtain precise informa- One of the latest developments in the
in the case of public hospitals - there is tion about their performance in near field of process mining, introduced by
growing pressure from governmental real-time. Zhengxing Huang and others at
bodies to refactor clinical processes in Zhejiang University in China, concerns
order to improve efficiency and reduce Such an endeavour, however, is made performance. Typically, a control-flow
costs. These two trends converge, difficult by the fact that reality is inher- model must be extracted from the event
prompting for the need to use run-time ently complex, so direct application of log prior to performance analysis.
data in order to support the analysis of process mining techniques may produce However, to study the performance of
existing processes. very large and confusing models, which healthcare processes, only a subset of
are quite difficult to interpret and ana- the recorded activities is usually consid-
In the area of process mining, there are lyze – in the parlance of process mining, ered – these are the key activities that
specialized techniques for the analysis these are known as “spaghetti” models. represent milestones in the process, and
of business processes according to a that are always present regardless of the
number of perspectives, including con- While the area of process mining is actual path of the patient. The time span
trol-flow, social network, and perform- being led by Wil van der Aalst at the between these activities becomes a Key
ance. These techniques are based on the Eindhoven University of Technology in Performance Indicator (KPI).
analysis of event data recorded in The Netherlands, here at the Technical
system logs. In general, any information University of Lisbon, in Portugal, we The ability to measure this KPI directly
system that is able to record the activi- have been developing techniques to from the event log is a major improve-
ties that are performed during process address the problem of how to extract ment with respect to previous perform-
execution can provide valuable data for information from event logs such that ance analysis techniques which rely on
process analysis. the output models are more amenable to a control-flow model that often includes
interpretation and analysis. To this end, too much behaviour. Here, we are inter-
These event data become especially rel- we have spent the last six years devel- ested in a predefined sequence of mile-
evant for the analysis of clinical oping a number of clustering, parti- stones and in retrieving the time span
processes, which are highly complex, tioning, and preprocessing techniques. between any pair of milestones.
dynamic, multi-disciplinary, and ad-hoc Such techniques have matured to the Incidentally, this approach also pro-
in nature. Until recently one could only point that they can be systematically vides the time span between the first
prescribe general guidelines for this kind applied to real-world event logs, and last activities, which can be used to
determine the length of stay (LOS) of sized public hospital. The hospital has resulting in over 30 000 recorded
the patient in the hospital, one of the an Electronic Patient Record (EPR) events, although there are only 18 dis-
most sought-after KPIs in healthcare system, which records the events that tinct activities.
processes. take place in several departments. The
event log used in this experiment was Figure 1 depicts a control-flow model
Back in Portugal, we applied this collected over a period of 12 days. A for these activities, illustrating the
approach in a case study carried out in total of 4851 patients entered the emer- reason why such diagrams are often
the emergency department of a mid- gency department in that period, called “spaghetti” models. In Figure 2,
we present the results for some key
activities. The first step – triage – deter-
mines the priority of the patient and
takes place once the patient enters the
hospital. For patients who require med-
ical examination, it takes on average
two hours to perform the first exam.
About two hours and 30 minutes later,
the patient receives the diagnosis, and
then is quickly discharged, on average
within three minutes. The resulting LOS
amounts to an average of four hours and
30 minutes.
Link: http://www.processmining.org/
Please contact:
Diogo R. Ferreira
IST – Technical University of Lisbon,
Portugal
Tel.: +351 21 423 35 52
Figure 2: Performance analysis E-mail: [email protected]
With High Throughput Sequencing (HTS) technologies, biology is experiencing a sequence data
deluge. A single sequencing experiment currently yields 100 million short sequences, or reads, the
analysis of which demands efficient and scalable sequence analysis algorithms. Diverse kinds of
applications repeatedly need to query the sequence collection for the occurrence positions of a
subword. Time can be saved by building an index of all subwords present in the sequences before
performing huge numbers of queries. However, both the scalability and the memory requirement of
the chosen data structure must suit the data volume. Here, we introduce a novel indexing data
structure, called Gk arrays, and related algorithms that improve on classical indexes and state of the
art hash tables.
Joint genetic and neuroimaging data analysis on large cohorts of subjects is a new approach used to
assess and understand the variability that exists between individuals. This approach, which to date
is poorly understood, has the potential to open pioneering directions in biology and medicine. As
both neuroimaging- and genetic-domain observations include a huge number of variables (of the
order of 106), performing statistically rigorous analyses on such Big Data represents a
computational challenge that cannot be addressed with conventional computational techniques. In
the A-Brain project, researchers from Inria and Microsoft Research explore cloud computing
techniques to address the above computational challenge.
Several brain diseases have a genetic the brain and genome regions that may The Map-Reduce programming model
origin, or their occurrence and severity be involved in this link entails a huge has recently arisen as a very effective
is related to genetic factors. Genetics number of hypotheses, hence a drastic approach to develop high-performance
plays an important role in understanding correction of the statistical significance applications over very large distributed
and predicting responses to treatment for of pair-wise relationships, which in turn systems such as grids and now clouds.
brain diseases like autism, Huntington’s results in a crucial reduction of the sen- KerData has recently proposed a set of
disease and many others. Brain images sitivity of statistical procedures that aim algorithms for data management, com-
are now used to understand, model, and to detect the association. It is therefore bining versioning with decentralized
quantify various characteristics of the desirable to set up techniques that are as metadata management to support scal-
brain. Since they contain useful markers sensitive as possible to explore where in able, efficient, fine-grain access to mas-
that relate genetics to clinical behaviour the brain and where in the genome a sig- sive, distributed Binary Large OBjects
and diseases, they are used as an inter- nificant link can be detected, while cor- (BLOBs) under heavy concurrency. The
mediate between the two. Currently, recting for family-wise multiple com- project investigates the benefits of inte-
large-scale studies assess the relation- parisons (controlling for false positive grating BlobSeer with Microsoft Azure
ships between diseases and genes, typi- rate). storage services and aims to evaluate
cally involving several hundred patients the impact of using BlobSeer on Azure
per study. In the A-Brain project, researchers of with large-scale application experi-
the Parietal and KerData Inria teams ments such as the genetics-neu-
Imaging genetic studies linking func- jointly address this computational roimaging data comparisons addressed
tional MRI data and Single Nucleotide problem using cloud computing tech- by Parietal. The project is supervised by
Polyphormisms (SNPs) data may face a niques on Microsoft Azure cloud com- the Joint Inria-Microsoft Research
dire multiple comparisons issue. In the puting environment. The two teams Centre.
genome dimension, genotyping DNA bring their complementary expertise:
chips allow recording of several hundred KerData (Rennes) in the area of scal- Sophisticated techniques are required to
thousand values per subject, while in the able cloud data management and perform sensitive analysis on the tar-
imaging dimension an fMRI volume Parietal (Saclay) in the field of neu- geted large datasets. Univariate studies
may contain 100k-1M voxels. Finding roimaging and genetics data analysis. find an SNP and a neuroimaging trait
Figure 1: Identifying areas in the human brain (red and orange colors) in which activation is correlated with a given SNP data,
using A-Brain.
that are significantly correlated (eg the Azure platform are available for the Please contact:
amount of functional activity in a brain duration of the project (three years). In Gabriel Antoniu
region is related to the presence of a order to execute the complex A-Brain Inria, France
minor allele on a gene). In regression application one needs a parallel pro- Tel: +33 2 99 84 72 44
studies, some sets of SNPs predict a gramming framework (like E-mail: [email protected]
neuroimaging/behavioural trait (eg a set MapReduce), supported by a high per-
of SNPs predict a given brain character- formance storage backend. We there- Alexandru Costan
istic), while with multivariate studies, fore developed TomusBlobs, an opti- Inria, France
an ensemble of genetic traits predict a mized storage service for Azure clouds, Tel: +33 2 99 84 25 34
certain combination of neuroimaging leveraging the high throughput under E-mail: [email protected]
traits. Typically, the data sets involved heavy concurrency provided by the
contain 50K voxels and 500K SNPs. BlobSeer library developed at KerData. Bertrand Thirion
Additionally, in order to obtain results TomusBlobs is a distributed storage Inria, France
with a high degree of confidence, a system that exposes the local storage Tel: +33 1 69 08 79 92
number of 10K permutations is required from the computation nodes in the E-mail: [email protected]
on the initial data, resulting in a total cloud as a uniform shared storage to the
computation of 2.5 × 1014 associations. application. Using this system as a
Several regressions are performed, each storage backend, we implemented
giving a set of correlations, and all these TomusMapReduce, a MapReduce plat-
intermediate data must be stored in form for Azure. With these tools we
order to compare the values of each were able to execute the neuro-imaging
simulation and keep that which is most and genetic application in Azure and to
significant. The intermediate data that create a demo for it. Preliminary results
must be stored can easily reach 1.77 show that our solution brings substan-
PetaBytes. tial benefits to data intensive applica-
tions like A-Brain compared to
Traditional computing has shown its approaches relying on state-of-the-art
limitations in offering a solution for cloud object storage.
such a complex problem in the context
of Big Data. Performing one experi- The next step will be to design a per-
ment to determine if there is a correla- formance model for the data manage-
tion between one brain location and any ment layer, which considers the cloud’s
of the genes on a single core would take variability and provides some opti-
about five years. The computational mized deployment configurations. We
framework, however, can easily be run are also investigating new techniques to
in parallel and with the emergence of make more efficient correlations
the recent cloud platforms we could between genes and brain characteristics.
perform such computations in a reason-
able time (days). Links:
http://www.msr-
Our goal is to use Microsoft’s Azure inria.inria.fr/Projects/a-brain
cloud to performing such experiments. http://www.irisa.fr/kerdata/
For this purpose, two million hours per http://parietal.saclay.inria.fr/
year and 10 TBytes of storage on the http://blobseer.gforge.inria.fr/
For decades, compute power and storage have become steadily cheaper, while network speeds,
although increasing, have not kept up. The result is that data is becoming increasingly local and thus
distributed in nature. It has become necessary to move the software and hardware to where the
data resides, and not the reverse. The goal of LAWA is to create a Virtual Web Observatory based on
the rich centralized Web repository of the European Archive. The observatory will enable Web-scale
analysis of data, will facilitate large-scale studies of Internet content and will add a new dimension
to the roadmap of Future Internet Research – it’s about time!
trustworthiness or personally sensitive As part of our research on entity disam- problems that are both data and compu-
social content. The VWO will provide biguation we have developed the AIDA tational intensive.
graphical browsing to aid the user in system (Accurate Online Disambiguation
discovering information along with its of Named Entities), which includes an Our work is supported by the 7th
relationships. Hyperlinks constitute a efficient and accurate NED method, Framework IST programme of the
type of relationship that, when visual- suited for online usage. Our approach European Union through the focused
ized, provides insight into the connec- leverages the YAGO2 knowledge base as research project (STREP) on
tion of terms or documents and is useful an entity catalog and a rich source of rela- Longitudinal Analytics of Web Archive
for determining their information tionships among entities. We cast the data (LAWA) under contract no.
source, quality and trustworthiness. Our joint mapping into a graph problem: men- 258105.
main goal is to present relevant inter- tions from the input text and candidate
connections of a large graph with tex- entities define the node set, and we con- Links:
tual information, by visualizing a well- sider weighted edges between mentions LAWA project website:
connected small subgraph fitting the and entities, capturing context similari- http://www.lawa-project.eu/
information needs of the user. ties, and weighted edges among entities,
capturing coherence. AIDA webinterface:
A Web archive of timestamped versions https://d5gate.ag5.mpi-
of Web sites over a long-term time Figure 2 shows the user interface of sb.mpg.de/webaida/
horizon opens up great opportunities for AIDA. The left panel allows the user to
analysts. By detecting named entities in select the underlying disambiguation YAGO2 webinterface:
Web pages we raise the entire analytics methods and to insert an input text. The https://d5gate.ag5.mpi-sb.mpg.de/
to a semantic rather than keyword-level. panel in the middle shows for each men- webyagospotlx/WebInterface
Difficulties here arise from name ambi- tion (in green) the disambiguated entity,
guities, thus requiring a disambiguation linked with the corresponding Wikipedia Visualization demo:
mapping of mentions (noun phrases in articles. In addition, a clickable type http://dms.sztaki.hu/en/letoltes/wimmut
the text that can denote one or more cloud allows the user to explore the types -searching-and-navigating-wikipedia
entities) onto entities. For example, of the named entities contained. Finally,
“Bill Clinton” might be the former US the rightmost panel provides statistics FIRE website:
president William Jefferson Clinton, or about disambiguation process. http://cordis.europa.eu/fp7/ict/fire/
any other William Clinton contained in
Wikipedia. Ambiguity further increases In conclusion, supercomputing software Please contact:
if the text only contains “Clinton” or a architectures will play a key role in Marc Spaniol
phrase like “the US president”. The scaling data and computational inten- Max-Planck-Institute for Informatics,
temporal dimension may further intro- sive problems in business intelligence, Saarbrücken, Germany
duce complexity, for example when information retrieval and machine E-mail: [email protected]
names of entities have changed over learning. In the future we will be identi-
time (eg people getting married or fying new applications such as Web 3.0, András Benczúr
divorced, or organizations that undergo and considering the combination of dis- SZTAKI, Budapest, Hungary
restructuring in their identities). tributed and many-core computing for E-mail: [email protected]
Vision Cloud is an ongoing European project on cloud computing. The novel storage and
computational infrastructure is designed to meet the challenge of providing for tomorrow’s
data-intensive services.
The two most important trends in infor- (including storage, network, and com- low maintenance costs and efficient
mation technology today are the puting) are integrated to provide value- resource utilization.
increasing proliferation of data-inten- added services to users. The growing
sive services and the digital convergence number, scale, variety and sophistica- The two critical ingredients to deliver
of telecommunications, media and ICT. tion of data-intensive services impose converged data-intensive services are
A growing number of services aggre- demanding requirements that cannot be the storage infrastructure and the com-
gate, analyze and stream rich data to met using today’s computing technolo- putational infrastructure. Not only must
service consumers over the Internet. We gies. There is the need to simultane- the storage offer unprecedented scala-
see the inevitability of media, telecom- ously service millions of users, accom- bility, good and tunable availability, it
munications and ICT services becoming modate the rapid growth of services and must also provide the means to struc-
merged into one operating platform, the sheer volume of data, while at the ture, categorize, and search massive
where content and ICT resources same time providing high availability, data sets. The computational framework
The massive amount of digital data currently being produced by industry, commerce and research is
an invaluable source of knowledge for business and science, but its management requires scalable
storage and computing facilities. In this scenario, efficient data analysis tools are vital. Cloud systems
can be effectively exploited for this purpose as they provide scalable storage and processing services,
together with software platforms for developing and running data analysis environments. We present
a framework that enables the execution of large-scale parameter sweeping data mining applications
on top of computing and storage services.
The past two decades have been charac- Cloud systems can be effectively • a task queue that contains the data
terized by an exponential growth of dig- employed to handle this class of appli- mining tasks to be executed
ital data production in many fields of cation since they provide scalable • a task status table that keeps informa-
human activity, from science to enter- storage and processing services, as well tion about the status of all tasks
prise. In the biological, medical, astro- as software platforms for developing • a pool of k workers, where k is the
nomic and earth science fields, for and running data analysis environments number of virtual servers available,
example, very large data sets are pro- on top of such services. in charge of executing the data min-
duced daily from the observation or sim- ing tasks submitted by the users
ulation of complex phenomena. We have worked on this topic by devel- • a website that allows users to submit,
Unfortunately, massive data sets are oping Data Mining Cloud App, a soft- monitor the execution, and access the
hard to understand, and models and pat- ware framework that enables the execu- results of data mining tasks.
terns hidden within them cannot be iden- tion of large-scale parameter sweeping
The website includes three main sec-
tions: i) task submission that allows
users to submit data mining tasks; ii)
task status that is used to monitor the
status of submitted tasks and to access
results; iii) data management that allows
users to manage input data and results.
tified by humans directly, but must be data analysis applications on top of The user can monitor the status of each
analyzed by computers using knowledge Cloud computing and storage services. single task through the task status sec-
discovery in database (KDD) processes The framework has been implemented tion of the website, as shown in Figure
and data mining techniques. using Windows Azure and has been 3. For each task, the current status (sub-
used to run large-scale parameter mitted, running, done or failed) and
Data analysis applications often need to sweeping data mining applications on a status update time are shown.
run a data mining task several times, Microsoft Cloud data centre. Moreover, for each task that has com-
using different parametric values before pleted its execution, the system enables
getting significant results. For this Figure 1 shows the architecture of the two links: the first (Stat) gives access to
reason, parameter sweeping is widely Data Mining Cloud App framework, as a file containing some statistics about
used in data mining applications to it is implemented on Windows Azure. the amount of resources consumed by
explore the effects of using different The framework includes the following the task; the second (Result) visualizes
values of the parameters on the results of components: the task result.
data analysis. This is a time consuming • a set of binary and text data containers
process when a single computer is used (Azure blobs) used to store data to be We evaluated the performance of the
to mine massive data sets since it can mined (input datasets) and the results of Data Mining Cloud App through the
require very long execution times. data mining tasks (data mining models) execution of a set of long-running
parameter sweeping data mining appli- Other than supporting users in designing Links:
cations on a pool of virtual servers and running parameter sweeping data http://www.microsoft.com/windowsazure
hosted by a Microsoft Cloud data mining applications on large data sets, http://grid.deis.unical.it
center. The experiments demonstrated we intend to exploit Cloud computing
the effectiveness of the Data Mining platforms for running knowledge dis-
Cloud App framework, as well as the covery processes designed as a combi- Please contact:
scalability that can be achieved through nation of several data analysis steps to Domenico Talia
the parallel execution of parameter be run in parallel on Cloud computing ICAR-CNR and
sweeping data mining applications on a elements. To achieve this goal, we are DEIS, University of Calabria, Italy
pool of virtual servers. For example, the currently extending the Data Mining Tel: +39 0984 494726
classification of a large dataset (290,000 Cloud App framework to also support E-mail: [email protected]
records) on a single virtual server workflow-based KDD applications, in
required more than 41 hours, whereas it which complex data analysis applica- Fabrizio Marozzo and Paolo Trunfio
was completed in less than three hours tions are specified as graphs that link DEIS, University of Calabria, Italy
on 16 virtual servers. This corresponds together data sources, data mining algo- E-mail: [email protected],
to an execution speedup equal to 14. rithms, and visualization tools. [email protected]
In today’s highly networked world, any researcher can study massive amounts of source code
even on inexpensive off-the-shelf hardware. This leads to opportunities for new analyses and
tools. The analysis of big software data can confirm the existence of conjectured phenomena,
expose patterns in the way a technology is used, and drive programming language research.
The amount and variety of available tems. A software ecosystem is a group a single system. As a result, analysis
external information associated with of software systems that is developed techniques that work for the individual
evolving software systems is stag- and co-evolves together in the same system no longer apply.
gering: data sources include bug environment. The usual environments
reports, mailing list archives, issue in which ecosystems exist are organi- Recently, we have seen the emergence
trackers, dynamic traces, navigation zations (companies, research centres, of a new type of large repository of
information extracted from the IDE, universities) or communities (open information associated with software
and meta-annotations from the ver- source communities, programming systems which can be orders of magni-
sioning system. All these sources of language communities). The systems tude larger than an ecosystem: the
information have a time dimension, within an ecosystem usually co- super-repository. Super-repositories
which is tracked in versioning control evolve, depend on each other, have are repositories of project repositories.
systems. intersecting sets of developers as The existence of super-repositories
authors, and use similar technologies provides us with an even larger source
Software systems, however, do not and libraries. Analyzing an entire of information to analyze, exceeding
exist in isolation but co-exist in larger ecosystem entails dealing with orders ecosystems again by orders of magni-
contexts known as software ecosys- of magnitude more data than analyzing tude.
Link:
http://scg.unibe.ch/bigsoftwaredata
Please contact:
Mircea Lungu, Oscar Nierstrasz,
Niko Schwartz
Unversity of Bern, Switzerland
E-mail: [email protected],
Figure 2: A subset of the software systems in the SqueakSource ecosystem shows a tight network [email protected],
of compile-time dependencies. [email protected]
The potential of Semantic Big Data is currently severely underexploited due to their huge space
requirements, the powerful resources required to process them and their lengthy consumption time.
We work on novel compression techniques for scalable storage, exchange, indexing and query
answering of such emerging data.
The Web of Data materializes the basic compress these plain formats using uni- centric view. HDT modularizes the data
principles of the Semantic Web. It is a versal compression algorithms. The and uses the skewed structure of big
collective effort for the integration and resultant file lacks logical structure and RDF graphs to achieve large spatial sav-
combination of data from diverse there is no agreed way to efficiently ings. In addition, it includes metadata
sources, allowing automatic machine- publish such data, ie to make them describing the RDF dataset which
processing and reuse of information. (publicly) available for diverse pur- serves as an entrance point to the infor-
Data providers make use of a common poses and users. In addition, the data are mation on the dataset and leads to clean,
language to describe and share their hardly usable at consumption; the con- easy-to-share publications. Our experi-
semi-structured data, hence one person, sumer has to decompress the file and, ments show that big RDF is now
or a machine, can move through dif- then, to use an appropriate external exchanged and processed 10-15 times
ferent datasets with information about
the same resource. This common lan-
guage is the Resource Description
Framework (RDF), and Linked Open
Data is the project that encourages the
open publication of interconnected
datasets in this format.
query evaluation and index construc- these concepts is on sensor networks, Please contact:
tion. where data throughput plays a major Javier D. Fernández
role. In a near future, where RDF University of Valladolid, Spain
The HDT format has recently been exchange together with SPARQL query E-mail: [email protected]
accepted as a W3C Member resolution will be the most common
Submission, highlighting the relevancy daily task of Web machine agents, our Miguel A. Martínez-Prieto
of ‘efficient interchange of RDF efforts will serve to alleviate current University of Valladolid, Spain
graphs’. We are currently working on scalability drawbacks. E-mail: [email protected]
our HDT-based store. In this sense,
compression is not only useful for Links: Mario Arias
exchanging big data, but also for distri- DataWeb Research Group: DERI, National University of Ireland
bution purposes, as it allows bigger http://dataweb.infor.uva.es Galway, Ireland
amounts of data to be managed using RDF/HDT: http://www.rdfhdt.org E-mail: [email protected]
fewer computational resources. Another HDT W3C Member Submission:
research area where we plan to apply http://www.w3.org/Submission/2011/03/
The digital collections of scientific and memory institutions – many of which are already in the petabyte
range – are growing larger every day. The fact that the volume of archived digital content worldwide is
increasing geometrically, demands that their associated preservation activities become more scalable. The
economics of long-term storage and access demand that they become more automated. The present state
of the art fails to address the need for scalable automated solutions for tasks like the characterization or
migration of very large volumes of digital content. Standard tools break down when faced with very large
or complex digital objects; standard workflows break down when faced with a very large number of objects
or heterogeneous collections. In short, digital preservation is becoming an application area of big data, and
big data is itself revealing a number of significant preservation challenges.
The EU FP7 ICT Project SCAPE large scale data ingest, analysis,
(Scalable Preservation and maintenance.
Environments), running since
February 2011, was initiated in Data Analysis and Preservation
order to address these challenges Platform
through intensive computation The SCAPE Platform will pro-
combined with scalable moni- vide an extensible infrastructure
toring and control. In particular, for the execution of digital
data analysis and scientific preservation workflows on large
workflow management play an volumes of data. It is designed as
important role. Technical devel- an integrated system for content
opment is carried out in three sub-proj- Figure 1: Challenges of the SCAPE project holders employing a scalable architec-
ects and will be validated in three ture to execute preservation processes
Testbeds (refer to Figure 1). Council’s Diamond Synchrotron source on archived content. The system is
and ISIS suite of neutron and muon based on distributed and data-centric
Testbeds instruments). The Testbeds provide a systems like Hadoop and HDFS and
The SCAPE Testbeds will examine very description of preservation issues with a programming models like MapReduce.
large collections from three different special focus on data sets that imply a A suitable storage abstraction will pro-
application areas: Digital Repositories real challenge for scalable solutions. vide integration with content reposito-
from the library community (including They range from single activities (like ries at the storage level for fast data
nearly two petabytes of broadcast audio data format migration, analysis and exchange between the repository and
and video archives from the State identification), to complex quality the execution system. Many SCAPE
Library of Denmark, who are adding assurance workflows. These are sup- collections consist of large binary
more than 100 terabytes every year), ported at scale by solutions like the objects that must be pre-processed
Web Content from the web archiving SCAPE Platform or Planning and before they can be expressed using a
community (including over a petabyte of Monitoring services detailed below. structured data model. The Platform
web harvest data), and Research Data SCAPE solutions will be evaluated will therefore implement a storage hier-
Sets from the scientific community against defined institutional data sets in archy for processing, analysing and
(including millions of objects from the order to validate their applicability in archiving content, relying on a combi-
UK Science and Technology Facilities real life application scenarios, such as nation of distributed database and file
Research in the field of information non-trivial amount of coding to change documents. Some of the advantages of
retrieval is often concerned with an existing search engine. If, for this method are: 1) Researchers spend
improving the quality of search systems. instance, a new idea requires informa- less time on coding and debugging new
The quality of a search system crucially tion that is not currently in the search experimental retrieval approaches; 2) It
depends on ranking the documents that engine’s inverted index, then the is easy to include new information in
match a query. To get the best docu- researcher has to re-index the data or the ranking algorithm, even if that infor-
ments ranked in the top results for a even recode parts of the system’s mation would not normally be included
query, search engines use numerous sta- indexing facilities, and possibly recode in the search engine’s inverted index; 3)
tistics on query terms and documents, the query processing facilities that Researchers are able to oversee all or
such as the number of occurrences of a access this information. If the new idea most of the code used in the experiment;
term in the document, the number of requires query processing techniques 4) Large-scale experiments can be done
occurrences of a term in the collection, that are not supported by the search in reasonable time.
the number of hyperlinks pointing at a engine (for instance sliding windows,
document, the number of occurrences of phrases, or structured query expansion) MapReduce was developed at Google
a term in the anchor texts of hyperlinks even more work has to be done. as a framework for batch processing of
pointing at a document, etc. New large data sets on clusters of commodity
ranking ideas are tested off-line on Instead of using the indexing facilities machines. Users of the framework
query sets with human rated documents. of the search engine, we propose to use implement a mapper function that
If such ideas are radically new, experi- MapReduce to test new retrieval processes a key/value pair to generate a
mentally testing them might require a approaches by sequentially scanning all set of intermediate key/value pairs, and
To date, big data applications have focused on the store-and-process paradigm. In this paper we
describe an initiative to deal with big data applications for continuous streams of events.
In many emerging applications, the instance, very high volumes of IP should be processed online and need to
volume of data being streamed is so packets and events from sensors at fire- be processed as quickly as possible in
large that the traditional ‘store-then- walls, network switches and routers and order to provide meaningful results in
process’ paradigm is either not suitable servers need to be analyzed and should real-time. An ideal system would detect
or too inefficient. Moreover, soft-real detect attacks in minimal time, in order fraud during the authorization process
time requirements might severely limit to limit the effect of the malicious that lasts hundreds of milliseconds and
the engineering solutions. Many sce- activity over the IT infrastructure. deny the payment authorization, mini-
narios fit this description. In network Similarly, in the fraud department of a mizing the damage to the user and the
security for cloud data centres, for credit card company, payment requests credit card company.
Link:
http://www.massif-project.eu
Please contact:
events and results are output any time amount of computing resources to the Vicenzo Gulisano, Ricardo Jimenez-
the actual data satisfies the query predi- actual workload and achieve cost-effec- Peris, Marta Patiño-Martinez, Claudio
cate. A continuous query is modelled as tiveness. Indeed, any parallel system Soriente, Universidad Politécnica de
a graph where edges identify data flows with a static number of processing Madrid, Spain
and nodes represent operators that nodes might experience under-provi- Tel: +34 913367452
process input data. sioning (ie the overall computing power E-mail: [email protected],
is not enough to handle the input load) [email protected],
Centralized CEPs suffered from single or over-provisioning (ie the current load [email protected],
node bottlenecks and were quickly is lower than the system maximum [email protected]
replaced by distributed CEPs where the throughput and some nodes are running
query was distributed across several below their capacity). Patrick Valduriez
nodes, in order to decrease the per-node Inria, LIRMM, France
tuple processing time and increase the With those goals in mind, we are devel- Tel: +33 467149726
overall throughput. Nevertheless, each oping StreamCloud, a parallel-distrib- E-mail: [email protected]
One of the main challenges facing next generation Cloud platform services is the need to
simultaneously achieve ease of programming, consistency, and high scalability. Big Data
applications have so far focused on batch processing. The next step for Big Data is to move to the
online world. This shift will raise the requirements for transactional guarantees. CumuloNimbo is a
new EC-funded project led by Universidad Politécnica de Madrid (UPM) that addresses these issues
via a highly scalable multi-tier transactional platform as a service (PaaS) that bridges the gap
between OLTP and Big Data applications.
CumuloNimbo aims at architecting and for familiar programming interfaces means that existing applications will be
developing an ultra-scalable transac- such as Java Enterprise Edition (EE), able to run totally unmodified on top of
tional Cloud platform as a service SQL, as well as No SQL data stores, CumuloNimbo and benefit automati-
(PaaS). The current state of the art in ensuring seamless portability across a cally from the underlying ultra-scala-
transactional PaaS is to scale by wide range of application domains. bility, elasticity and high availability.
resorting to sharding or horizontal parti- Simultaneously the platform is Semantic transparency means that
tioning of data across database servers, designed to support Internet-scale applications will continue to work
sacrificing consistency and ease of pro- Cloud services (hundreds of nodes pro- exactly as they did on centralized infra-
gramming. Sharding destroys transac- viding service to millions of clients) in structure, with exactly the same seman-
tional semantics since it is applied to terms of both data processing and tics and preserving the same coherence
only subsets of the overall data set. storage capacity. These challenges they had. The full transparency will
Additionally, it forces modifications to require careful consideration of archi- remove one of the most important
applications and/or requires rebuilding tectural issues at multiple tiers, from the obstacles to migration of applications to
them from scratch, and in most cases application and transactional model all the cloud, ie the need to heavily modify,
also changing the business rules to adapt the way to scalable communication and or even fully rebuild, them.
to the shortcomings of current technolo- storage.
gies. Thus it becomes imperative to CumuloNimbo adopts a novel approach
address these issues by providing an CumuloNimbo improves the scalability for providing SQL processing. Its main
easily programmable platform with the of transactional systems, enabling them breakthrough lies in the scalability of
same consistency levels as current to process update transaction rates in transactional management, which is
service-oriented platforms. the range of one million update transac- achieved by decomposing the different
tions per second in a fully transparent functions required for transactional pro-
The CumuloNimbo PaaS addresses way. This transparency is both syntactic cessing and scaling each of them sepa-
these challenges by providing support and semantic. Syntactic transparency rately in a composable manner (refer to
ConPaaS makes it easy to write scalable Cloud applications without worrying about the complexity
of the Cloud.
the Cloud takes about 10 minutes. the performance she expects. ConPaaS allow third-party developers to upload
Increasing the processing capacity of the will dimension each service such that their own service as a plugin to the
application requires two mouse clicks. the system meets its performance guar- existing ConPaaS system.
antees, while using the smallest pos-
One of ConPaaS’s Big Data use cases is sible number of computing resources. In conclusion, ConPaaS is a runtime
a bioinformatics application that In the wiki example, for instance, one environment for Cloud applications. It
analyses large datasets across distrib- may want to request that user requests takes care of the complexity of Cloud
uted computers. It uses large amounts of are processed on average in no more environments, letting application devel-
data from a Chip-Seq analysis, a type of than 500 milliseconds. opers focus on what they do best: pro-
genomic analysis methodology, and an gram great applications to satisfy their
application that can be parallelized in We plan to allow users to upload com- customers’ needs.
order to make use of multiple instances plex applications in a single operation.
or processors to analyse data faster. The Instead of starting and configuring mul- Links:
application stores its data in XtreemFS tiple ConPaaS services one by one, a http://www.conpaas.eu/
and makes extensive use of ConPaaS’s user will be able to upload a single man- http://contrail-project.eu/
MapReduce, TaskFarming, and Web ifest file describing the entire applica- http://scalaris.googlecode.com/
hosting services. Users will use the tion organization. Thanks to this mani- http://xtreemfs.org/
application either directly through an fest, ConPaaS will be able to orchestrate
API or through a web interface. the deployment and configuration of Please contact:
entire applications automatically. Guillaume Pierre
Although ConPaaS is already suffi- Vrije Universiteit in Amsterdam,
ciently mature to support challenging Finally, we plan to provide an SDK for The Netherlands
applications, we have many plans for external users to implement their own E-mail: [email protected]
further developments. In the near services. For example, one could write a
future, instead of manually choosing the new service for demanding statistical Thorsten Schuett
number of resources each service analysis, for video streaming, or for sci- Zuse Institute in Berlin, Germany
should use, a user will be able to specify entific workflows. The platform will E-mail: [email protected]
Can we have information in advance on organized crime movements? How can fraud and corruption be
fought? Can cybercrime threats be tackled in a safe way? The Crime and Corruption Observatory of the
European Project FuturICT will work at answering these questions. Starting from Big Data, it will face big
challenges, and will propose new ways to analyse and understand social phenomena.
The cost of crime in the United States is The starting point is Big Data: the huge important European universities and
estimated to be more than $1 trillion amount of digital information now avail- research institutions will cooperate to
annually. The cost of corruption ranges able enables the development of virtual develop the necessary components.
from 2 to 5 per cent of global Gross techno-socio-economic models from Scientists from many different fields -
Domestic Product (GDP), ie from $800 existing and new information tech- from cognitive and social science to crimi-
billion to $2 trillion US dollars. The cost nology systems. The Observatory will nology, from artificial intelligence to com-
of war on terrorism in the US since 9/11 collect huge data sets and run massive plexity science, from statistics to eco-
is over $1 trillion. data mining and large-scale computer nomics and psychology - will be involved,
simulations of social dynamics related to establishing a pool of varied expertise. The
These are just a few examples of the huge criminal activities. It will be built using method will thus be strongly interdiscipli-
impact of crime on social, legal and eco- innovative technological instruments. nary: this is the only way to promote a real
nomic systems. The situation in Europe is This approach requires an internation- paradigm shift in our approach to policy
similarly dramatic. To face the problem ally recognized scientifically grounded and decision-making. On the one hand, the
in a new way, the “Crime and Corruption strategy, able to embrace different way policies are designed can be enhanced
Observatory” is being set up in order to national policies, since global threats through innovative “what-if” analyses
develop new technology to study and pre- such as crime and corruption require developed by complex models and simu-
dict the evolution of phenomena that global answers. lations. On the other hand, the goal is to
threaten the security of our society. The create new tools to support police and
Observatory aims at building a data infra- The Observatory will be built as a security agencies and services with more
structure to support crime prevention and European network, with a central node effective instruments for law enforce-
reduce the costs of crime. probably in Italy. A large number of ment.
Long-established technological platforms are no longer able to address the data and processing
requirements of the emerging data-intensive scientific paradigm. At the same time, modern
distributed computational platforms are not yet capable of addressing the global, elastic, and
networked needs of the scientific communities producing and exploiting huge quantities and
varieties of data. A novel approach, the Hybrid Data Infrastructure, integrates several technologies,
including Grid and Cloud, and promises to offer the necessary management and usage capabilities
required to implement the ‘Big Data’ enabled scientific paradigm.
A recent study, promoted by The Royal The management and processing of data types and data sources requiring
Society of London in cooperation with such datasets is beyond the capacity of integration, is high.
Elsevier, reviewed the changing patterns traditional technological approaches
of science highlighting that science is based on local, specialized data facili- Recent approaches, such as Grid and
increasingly a global, multidisciplinary ties. They require innovative solutions Cloud Computing, can only partially
and networked effort performed by sci- able to simultaneously address the satisfy these needs. Grid Computing
entists that dynamically collaborate to needs imposed by multidisciplinary col- was initially conceived as a technolog-
achieve specific objectives. The same laborations and by the new data-inten- ical platform to overcome the limita-
study also indicated that data-intensive sive pattern. These needs are character- tions in volume and velocity of single
science is gaining momentum in many ized by the well known three V’s: (i) laboratories by sharing and re-using
domains. Large-scale datasets come in Volume – data dimension in terms of computational and storage resources
all forms and shapes from huge interna- bytes is huge, (ii) Velocity – data collec- across laboratories. It offers a valid
tional experiments to cross-laboratory, tion, processing and consumption is solution in specific scientific domains
single laboratory, or even from a multi- demanding in terms of speed, and (iii) such as High Energy Physics. However,
tude of individual observations. Variety – data heterogeneity, in terms of Grid Computing does not handle
‘variety’ well. It supports a very limited software are made accessible by the are associated with rich metadata and
set of data types and needs a common infrastructure and are exploited by users provenance information which facilitate
software middleware, dedicated hard- using a thin client (namely a web effective re-use. These software frame-
ware resources and a costly infrastruc- browser), through dedicated on-demand works can be configured to implement
ture management regulated by rigid Virtual Research Environments. different policies ranging from the
policies and procedures. enforcement of privacy via encryption
gCube operates a large federation of and secure access control to the promo-
Cloud Computing, instead, provides an computational and storage resources by tion of data sharing while guaranteeing
elastic usage of resources that are main- relying on a rich and open array of provenance and attribution.
tained by third-party providers. It is mediator services for interfacing with
based on the assumption that the man- Grid (eg European Grid Infrastructure), The infrastructure enabled by gCube is
agement of hardware and middleware commercial cloud (eg Microsoft Azure, now exploited to serve scientists oper-
can be centralized, while the applica- Amazon EC2), and private cloud (eg ating in different domains, such as biol-
tions remain in the hands of the con- OpenNebula) infrastructures. ogists generating model-based large-
sumer. This considerably reduces appli- Relational databases, geospatial storage scale predictions of natural occurrences
cation maintenance and operational systems (eg Geoserver), nosql data- of species and statisticians managing
costs. However, as it is a technology bases (eg Cassandra, MongoDB), and and integrating statistical data.
based on an agreement between the reliable distributed computing plat-
resource provider and the consumer, it forms (eg Hadoop) can all be exploited gCube is a collaborative effort of several
is not suitable to manage the integration as infrastructural resources. This guar- research, academic and industrial cen-
of resources deployed and maintained antees a technological solution suitable tres including ISTI-CNR (IT),
by diverse distributed organizations. for the volume, velocity, and variety of University of Athens (GR), University
the new science patterns. of Basel (CH), Engineering Ingegneria
The Hybrid Data Infrastructure (HDI) is Informatica SpA (IT), University of
a new, more effective solution for man- gCube is much more than a software Strathclyde (UK), CERN (CH). Its
aging the new types of scientific integration platform; it is also equipped development has been partially sup-
dataset. It assumes that several tech- with software frameworks for data man- ported by the following European proj-
nologies, including Grid, private and agement (access and storage, integra- ects: DILIGENT (FP6-2003-IST-2),
public Cloud, can be integrated to pro- tion, curation, discovery, manipulation, D4Science (FP7-INFRA-2007-1.2.2),
vide an elastic access and usage of data mining, and visualization) and work- D4Science-II (FP7-INFRA-2008-1.2.2),
and data-management capabilities. flow definition and execution. These iMarine (FP7-INFRASTRUCTURES-
frameworks offer traditional data man- 2011-2), and EUBrazilOpenBio (FP7-
The gCube software system, whose agement facilities in an innovative way ICT-2011-EU-Brazil).
technological development has been by taking advantage of the plethora of
coordinated by ISTI-CNR, implements integrated and seamlessly accessible Link:
the HDI approach. It was initially con- technologies. The supported data types gCube website:
ceived to manage distributed computing cover a wide spectrum ranging from tab- http://www.gcube-system.org
infrastructures. It has evolved to operate ular data – eg observations, statistics,
large-scale HDIs enabling a data-man- records – to research products – eg dia- Please contact:
agement-capability-delivery model in grams, maps, species distribution Pasquale Pagano, ISTI-CNR, Italy
which computing, storage, data and models, enhanced publications. Datasets E-mail: [email protected]
A fundamental and emerging need with big amounts of data is data exploration: when we are
searching for interesting patterns we often do not have a priori knowledge of exactly what we are
looking for. Database cracking enables such data exploration features by bringing, for the first time,
incremental and adaptive indexing abilities to modern database systems.
Source: Shutterstock
Figure 1: Some worlds currently hosted on the World-Wide-Mind server, including a space simulation, a tennis game, a mining game and a battle
simulation.
Figure 2: An Architecture for the World-Wide-Mind showing the World-Wide-Mind server running three instances of worlds with minds; two of which
are running with hybrid minds and one running with an individual mind and a user proxy, allowing a user to interact with the world and mind.
Figure 1: Vision of an ultra-dense assembly of chip stacks in a 3D computer with fluidic heat removal and power delivery.
the required electrical energy to the chips. The pins dedicated reasons behind the performance limitation faced by modern
to power supply in a chip package easily outnumber the pins computing systems known as the memory wall, in which it
dedicated to signal I/O in high-performance microprocessors, takes several hundred to thousand CPU clock cycles to fetch
and the number of power pins has been growing faster than data from main memory. A dense, three-dimensional phys-
the total number of pins for all processor types. This power ical arrangement of semiconductor chips would allow much
problem is essentially a wiring problem, which is aggravated shorter communication paths and a corresponding reduction
by the fact that wiring for signal I/O has a problem of its own. in internal-delay roadblocks. Such a dense packaging of
The energy needed to switch a 1 mm interconnect is more chips is physically possible were it not for the problems of
than an order of magnitude larger than the energy needed to heat removal and energy delivery using today’s architecture.
switch one MOSFET; 25 years ago, the switching energies Using the techniques just described for handling electrical
were roughly equal. This disparate evolution of communica- energy delivery and heat removal, this dense packaging can
tion and computation is even more pronounced for another be achieved and communication bottlenecks with associated
vital performance metric, latency. The reason behind this delay avoided.
trend is simply that transistors have become smaller while the
chip size has not followed suit, leading to substantially longer The fluidic means of removing heat allows huge increases in
total wire lengths. A proposed new solution to this two-fold packaging density while the fluidic delivery of power with
wiring problem, again patterned after the mammalian brain, is the same medium saves the space used by conventional hard-
to use the coolant fluid as the means of delivering energy to wired energy delivery. The savings in space allow much
the chips. Probably it is easiest to think of this in terms of a denser architectures, not to mention the sharply reduced
kind of electrochemical, distributed battery where the cooling energy needs and improved latency as communication paths
electrolyte is constantly “recharged” while the heat is are chopped up into much shorter fragments
removed. This is how energy is delivered to the mammalian
brain and with great effectiveness. Conclusion
Using this new paradigm, the hope is that a petaflop super-
Using the example of the human brain as the best low-power computer could eventually be built in the space taken by
density computer we use technological analogs for sugar as today’s desktop PC. This is only a factor of eight smaller in
fuel and the trans-membrane proton pumps which evolution performance than the above-mentioned fastest computer in
has brought about as the chemical to electrical energy con- the world today! The second reference below describes the
verters. These analogs are inorganic redox couples which paradigm in detail.
have been mainly studied for grid-scale energy storage in the
form of redox flow batteries. These artificial electrochemical Links:
systems offer superior power densities than their biological Gerhard Ingmar Meijer, Thomas Brunschwiler, Stephan
counterparts, but would still be pressed to meet the weighty Paredes, and Bruno Michel, “Using Waste Heat from Data
challenge of satisfying the energy need of a fully loaded Centres to Minimize Carbon Dioxide Emission”, ERCIM
microprocessor. However, future high-performance com- News, No. 79, October, 2009:
puters which could be built around this fluidic power http://ercim-news.ercim.eu/en79/special/using-waste-heat-
delivery scheme would be much less power-intensive due to from-data-centres-to-minimize-carbon-dioxide-emission
their reduced communication cost.
P. Ruch, T. Brunschwiler, W. Escher, S. Paredes, and B.
Integration density and communication Michel, IBM J. Res. Develop.,“Toward 5-Dimensional
The number of communication links between logic blocks in Scaling: How Density Improves Efficiency in Future Com-
a microprocessor depends on the complexity of the intercon- puters”, Vol. 55 (5), 15:1-15:13, October 2011:
nect architecture and on the number of logic blocks in a way http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnum-
that can be described by a power-law known as Rent’s Rule. ber=6044603
Today, all microprocessors suffer from a break-down of
Rent’s Rule for high logic block counts because the number Please contact:
of interconnects does not scale smoothly beyond the chip Bruno Michel, IBM Zurich Research Laboratory
edge. The limited number of package pins is one of the main E-mail: [email protected]
high voltage is applied to viscous solution on a sharp con- guages, the meaning of a cell entry in a table is most easily
ducting tip, causing it to form a Taylor cone. As the electric defined by the leftmost cell of the same row and the topmost
field is increased, a fluid jet is extracted from the Taylor cell of the same column. Even when the internal encoding pro-
cone and accelerated toward a grounded collecting sub- vides fine-grained annotation, the conceptual gap between the
strate. Nanofibers (having at least one dimension of 100 low level representation of PODs and the semantics of the ele-
nanometer (nm) or less) exhibit special properties mainly ments is extremely wide. This makes it difficult:
due to extremely high surface to weight ratio. Within this • for human and applications attempting to manipulate POD
main nanoICT challenge, we are now working on simulating content. For example, languages such as XPath 1.0 are
possible improvements of DNA-functionalized nanofibers currently not applicable to PDF documents;
to be used as smart nanobiosensors through a combination • for machines attempting to learn of extraction rules auto-
of selective DNA chemical bonding with nanofiber surface. matically. In fact, existing wrapper induction approaches
infer the regularity of the structure of PODs only by ana-
Please contact: lyzing their internal structure.
Mario D’Acunto, ISM-CNR, Italy
E-mail: [email protected] The effectiveness of manual and automated wrapper con-
struction is thus limited by the need to analyze the internal
Ovidio Salvetti or Antonio Benassi, ISTI-CNR, Italy encoding of PODs with increasing structural complexity.
E-mail: [email protected], The intrinsic print/visual oriented nature of PDF encoding
[email protected] poses many issues in defining ‘ad hoc’ information extrac-
tion approaches.
In order to extract data from such documents, for purposes A CNR spin-off and start-up company, Altilia srl, will imple-
such as information extraction, it is necessary to consider their ment the approaches defined at ICAR-CNR. Altilia will pro-
internal representation structures as well as the spatial relation- vide semantic content capture technologies for the content
ships between presented elements. Typical problems that must management area of the IT market.
be addressed, especially in the case of PDF documents, are
incurred by the separation between document structure and Link: http://www.altiliagroup.com
spatial layout. Layout is important as it often indicates the
semantics of data items corresponding to complex structures Please contact: Massimo Ruffolo, ICAR-CNR, Italy
that are conceptually difficult to query, eg in western lan- E-mail: [email protected]
Information and communications technologies promise The use of a computer system implementing all the proce-
to have a significant impact on safety at sea. This is dures involved in SAROPs can reduce errors and the time
particularly true for smaller ships and boats that rarely needed to define a SAR plan; it can also improve on the
have active on board safety systems. We are currently probability of success of the rescue mission.
developing a system for computer-aided maritime
search and rescue operations within the ICT-E3 Project IAMSAR procedures were originally developed for manual
(ICT excellence programme of Western Sicily funded by calculation and they do not include the support for a computer
the Sicilian Regional Government). implementation. Hence, they avoid the adoption of complex
and effective algorithms for defining the search action plan;
A successful conclusion of a maritime search and rescue such algorithms are a viable solution when the support of a
(SAR) operation depends on several factors. Some, like the computer is available. For the same reason, the IAMSAR
weather and sea conditions, are uncontrollable; others can be manual only suggests two simple search paths for the naviga-
optimized and made more effective by the employment of tion of SAR units; furthermore, the handling and the alloca-
information and communications technologies. A system tion of these units over the search area, is quite rigid.
able to localize a vessel in trouble and to define the most effi-
cient plan for search and rescue activities is of great impor- Starting from these considerations, we have developed an
tance for safety at sea. enhanced implementation of IAMSAR procedures and have
integrated it with the automatic localization system outlined
The first step in building such a system is an accurate local- above. An enhanced statistical processing method (the
ization of vessels in trouble. Normally, the last known posi- Monte Carlo simulation technique) has been introduced in
tion (LKP) of a vessel is communicated to the rescuers by this implementation. It determines the search area instead of
the people on board. This apparently simple action can the prefixed probability maps suggested by the IAMSAR
become extremely difficult with adverse weather and sea manual. Crucial data, like wind force, state of the sea and
conditions. To add to the difficulty, those on
board may only know their positions approxi-
mately or, at worst, not at all. The vessel may
not be equipped with a global positioning
system or even a suitable compass to obtain at
least some bearings. Thus the localization pro-
vided by the people on board of a vessel in
trouble may generate imprecise or useless
information.
SAROPs are regulated at the international level Figure 1: Upper side: a simple scenario with two stations detecting an emergency call.
by a set of standard procedures defined and Lower side: datum, search area and probability distribution obtained by the Monte Carlo
described in the IAMSAR volume II simulation.
water current models from SOAP messages, can also be inte- of XML dump, or the output was error prone and in many
grated. The system includes a database in which all SAR cases – when using, for example, a Mediawiki instance as a
units at disposal of a Coast Guard Station are stored with converter – the conversion was slow.
details of their salient features. A friendly graphic user inter-
face has been designed, where several graphic layers of Consequently a new converter needed to be written based on
information can be overlaid or hidden in visualization. the following features:
• article boundaries have to be kept
Other significant and innovative extensions of the IAMSAR • only the textual information is necessary
procedures have been carried out during the development of • infoboxes – as they are duplicated information – are fil-
the ICT-E3 project, and are now subject to a patent applica- tered out
tion. We wish to acknowledge the collaboration with the per- • comments, templates and math tags are dismissed
sonnel of the Mazara del Vallo Coast Guard Station, their • other pieces of information, like tables, are converted to
invaluable help and stimulating suggestions have been fun- text.
damental to the success of the activity.
Wikipedia dumps are published regularly and, as the aim is to
Link: http://ceur-ws.org/Vol-621/paper21.pdf be up-to-date, the system needed an algorithm which is able
to process the whole English Wikipedia in a fast and reliable
Please contact: way. As the text is also subject to a couple of language pro-
Massimo Cossentino, Carmelo Lodato, Salvatore Lopes, cessing steps to facilitate plagiarism search, these steps were
Umberto Maniscalco included and the whole processing moved to the SZTAKI
ICAR-CNR, Italy Desktop Grid service (operated by the Laboratory of Parallel
E-mail: [email protected], [email protected], and Distributed Systems), where more than 40,000 users have
[email protected], [email protected] donated their free computational resources to scientific and
social issues. Desktop grids are usually suited for parameter
Salvatore Aronica, IAMC-CNR, Italy study or “bag-of-tasks” type of applications and have other
E-mail: [email protected] minor requirements for the applications in exchange for the
large amount of “free” computing resources made available
through them. SZTAKI Desktop Grid, established in 2005,
utilizes mostly volatile and non-dedicated resources (usually
Wikipedia as Text the donated computation time of desktop computers) to solve
compute intensive tasks from different scientific domains like
by Máté Pataki, Miklós Vajna and Attila Csaba Marosi mathematics and physics. Donors run a lightweight client in
the background which downloads and executes tasks. The
When seeking information on the Web, Wikipedia is an client also makes sure that only the excess resources are uti-
essential source: its English version features nearly four lized so there is no slowdown for the computer and the donor
million articles. Studies show that it is the most is not affected in any other way.
frequently plagiarized information source, so when
KOPI, a new translational plagiarism checker was The Mediawiki converter was written in PHP to support easy
created, it was necessary to find a way to add this vast development and compatibility with the existing codebase of
source of information to the database. As it is the KOPI Portal. The main functionality could be imple-
impossible to download the whole database in an easy- mented with less than 400 lines of code. The result was
to-handle format, like HTML or plain text, and all the adapted to the requirements of the desktop grid with the help
available Mediawiki converters have some flaws, a of GenWrapper, a framework specially created for porting
Mediawiki XML dump to plain text converter has been existing scientific applications to desktop grids.
written, which runs every time a new database dump
appears on the site with the text version being
published for everybody to use.
Systems and control science has played an important The Systems and Control position paper outlines ten crucial
enabling role in all major technological evolutions, from the areas in which control will make a strong impact in the next
steam engine to rockets, high-performance aircrafts, space decade: Ground and air smart traffic management; green
ships, high-speed trains, ‘green’ cars, digital cameras, smart electricity and Smart Grid; improved energy efficiency in
phones, modern production technology, medical equipments, production systems; security in decentralized automation;
and many others. It provides a large body of theory that mechatronics and control co-design and automation;
enables the analysis of dynamic systems in order to better analysis, control and adaptation of large infrastructures;
understand their behaviour, improve their design, and aug- autonomous systems; neurosciences; health care ( from open
ment them by advanced information processing, leading to medication to closed loop control) and cellular and biomole-
qualitative leaps in performance. Over the last fifty years the cular research.
field of systems and control has seen huge advances, lever-
aging technology improvements in sensing and computation Then the main overarching challenges behind these applica-
with breakthroughs in the underlying principles and mathe- tions are summarized: 1) System-wide coordination control
matics. Motivated by this record of success, control technol- of large-scale systems 2) Distributed networked control sys-
ogists are addressing contemporary challenges as well, tems 3) Autonomy, cognition and control 4) Model-based
examples of which include: systems engineering 5) Human-machine interaction. This is
followed by a discussion of new sectors where control will
• The automotive industry is focusing on active safety tech- have a major role to play: control and health; control and
nologies, which may ultimately lead to partially social and economic phenomena and markets; control and
autonomous driving, where humans will become passen- quantum engineering.
gers of automated vehicles governed by automatic control
algorithms for substantial parts of their trips, leading to Finally some operational recommendations are listed in
improved safety, better fuel economy, and better utiliza- order to provide the means to develop this extremely impor-
tion of the available infrastructure. tant scientific and technological discipline whose critical role
in ICT is essential to meet European Policies in the future.
• Automatic control will help improve surgery. Robots are
already used to support surgeons to minimize invasive Link:
procedures and increase accuracy of operations. It is con- Systems and Control position paper, 30 pages:
ceivable that semi-autonomous robots, remotely super- http://www.hycon2.eu/extrafiles/CONTROL_position_pape
vised by surgeons, will be capable of carrying out r_FP8_28_10_2011.pdf
unprecedentedly complex operations.
Please contact:
• Automatic control will play a fundamental role in the Sebastian Engell
energy landscape of the future, both in the efficient use of TU Dortmund, Germany
energy from various sources in industry and in buildings, E-mail: [email protected]
and in the management of the generation, distribution and
consumption of electrical energy with increased use of Françoise Lamnabhi-Lagarrigue
renewable and decentralized generation. The management CNRS, France
of important schedulable loads (eg the recharging of elec- E-mail: [email protected]
tric cars) and of distributed sources (eg at customers’
homes) calls for completely new large-scale control struc-
tures.
brain substrates. In all these research this area. The school targets doctoral
First ‘NetWordS’ areas NetWordS intends to encourage students and junior researchers from
multidisciplinarily informed integration fields as diverse as Cognition,
Workshop on and synthesis of existing approaches. Computer Science, Brain Sciences and
Linguistics, with a strong motivation to
Understanding the More than 40 research institutions from advance their awareness of theoretical,
16 European countries participate in typological, psycholinguistic, computa-
Architecture of the NetWordS. Scientists involved in tional and neuro-physiological aspects
NetWordS are playing a leading role in of word structure and processing.
Mental Lexicon: the following areas:
• Theoretical issues in morphology and Link: http://www.networds-esf.eu
Integration of Existing its interfaces
• Typological, variationist and histori- Please contact:
Approaches cal aspects of word structure Claudia Marzi, ILC-CNR, Italy
• Cognitive issues in lexical architec- Email: [email protected]
by Claudia Marzi ture [email protected]
• Short-term and long-term memory [email protected]
A workshop, held 24-26 November issues
2011 at the CNR Research Campus • Neuro-physiological correlates of
in Pisa, was organised within the lexical organization and processing
framework of “NetWordS”, the • Psycho-linguistic evidence on lexical
European Science Foundation organization and processing IEEE Winter School
Research Networking Programme on • Machine-learning approaches to mor-
the Structure of Words in the phology induction on Speech and Audio
languages of Europe. • Psycho-computational models of the
mental lexicon Processing organized
The ambitious goal of the workshop • Distributional Semantics
was to lay the foundations for an inter- and hosted
disciplinary European research agenda NetWordS promotes development of
on the Mental Lexicon for the coming interdisciplinary transnational scientific by FORTH-ICS
10 years, with particular emphasis on partnerships through short-visit grants,
three main challenges: that are assigned yearly on the basis of by Eleni Orphanoudakis
• Lexicon and Rules in the grammar open calls for short-term project pro-
• Word knowledge and word use posals. Scholars taking part in interdis- The Signal Processing Laboratory of
• Words and meanings ciplinary activities funded through the Institute of Computer Science
NetWordS grants convene periodically (ICS) of the Foundation for Research
Leading scholars were invited to to discuss and disseminate results. and Technology – Hellas (FORTH)
address three basic questions: Short-visit grants are also geared organized the international Winter
• In the speaker’s area of expertise, towards planning focused collaborative School on Speech and Audio
what are the most pressing open work, with a view to catalysing credible Processing for Immersive
issues concerning the architecture of large-scale proposals within more appli- Environments and Future Interfaces
the Mental Lexicon? cation-oriented European projects and of the IEEE Signal Processing Society,
• What and how can progress in other initiatives. which took place from 16-20 January
research areas contribute to address- at FORTH, in Heraklion Crete, Greece.
ing these issues? NetWordS organises yearly Workshops
• How can advances in our understand- on interdisciplinary issues in word The Winter School involved a series of
ing of these issues contribute to structure, usually between late lectures from distinguished researchers
progress in other areas? November and early December. A from all over the world and was focused
major conference is planned to take on current research trends and applica-
The Workshop brought together 37 par- place in 2015. tions in the areas of audio and speech
ticipants (Scholars, Post-Docs, PhD stu- signal processing. More specifically,
dents) from a number of European NetWordS is pleased to announce the current trends were presented in areas
countries. Eighteen speakers, from first Summer School on such as automatic speech recognition,
diverse scientific domains, presented Interdisciplinary Approaches to speech synthesis, speech and audio
cross-disciplinary approaches to the Exploring the Mental Lexicon - 2nd-6th modeling and coding, speech capture in
understanding of the architecture of July 2012 – Dubrovnik (Croatia). The noisy environments using one or more
Mental Lexicon, reflecting the interdis- school offers a broad and intensive microphones, and 3D audio rendering
ciplinarity and synergy fostered by range of interdisciplinary courses on using two or more loudspeakers. The
NetWordS. Contributions were devoted methodological and topical issues Winter School also involved “hands-on”
to understanding the ontogenesis of related to the architecture of the mental sessions, where students were able to
word competence, creative usage of lexicon, its level of organisation, con- work on practical aspects of speech and
words in daily conversation, the archi- tent and functioning, and a series of audio signal processing, as well as
tecture of the mental lexicon and its key-note lectures on recent advances in “demo” sessions where researchers from
Book Announcement
Francesco Flammini (Editor)
Czech Research Consortium Polish Research Consortium for Informatics and Mathematics
for Informatics and Mathematics Wydział Matematyki, Informatyki i Mechaniki,
FI MU, Botanicka 68a, CZ-602 00 Brno, Czech Republic Uniwersytetu Warszawskiego, ul. Banacha 2, 02-097 Warszawa, Poland
http://www.utia.cas.cz/CRCIM/home.html http://www.plercim.pl/
Fonds National de la Recherche Spanish Research Consortium for Informatics and Mathematics,
6, rue Antoine de Saint-Exupéry, B.P. 1777 D3301, Facultad de Informática, Universidad Politécnica de Madrid,
L-1017 Luxembourg-Kirchberg Campus de Montegancedo s/n,
http://www.fnr.lu/ 28660 Boadilla del Monte, Madrid, Spain,
http://www.sparcim.es/
Foundation for Research and Technology – Hellas Swiss Association for Research in Information Technology
Institute of Computer Science c/o Professor Daniel Thalmann, EPFL-VRlab,
P.O. Box 1385, GR-71110 Heraklion, Crete, Greece CH-1015 Lausanne, Switzerland
http://www.ics.forth.gr/ http://www.sarit.ch/
FORTH
Fraunhofer ICT Group Magyar Tudományos Akadémia
Friedrichstr. 60 Számítástechnikai és Automatizálási Kutató Intézet
10117 Berlin, Germany P.O. Box 63, H-1518 Budapest, Hungary
http://www.iuk.fraunhofer.de/ http://www.sztaki.hu/
You can also subscribe to ERCIM News and order back copies by filling out the form at the ERCIM Web site at
http://ercim-news.ercim.eu/