Performance Computing
Performance Computing
20
1 .20
16.0
i t ion
Ed
Funded by the European Union Horizon 2020 Framework Programme (H2020-EU.1.2.2. - FET Proactive)
Overall Editors and Authors
Authors
Compiled by
We also acknowledge the numerous people that provided valuable feedback at the roadmapping workshops at
HiPEAC CSW and HPC Summit, to HiPEAC and EXDCI for hosting the workshops and Xavier Salazar for the
organizational support.
2
Executive Summary
Radical changes in computing are foreseen for the are now gathered by sensors augmenting data from
next decade. The US IEEE society wants to “reboot simulation available for analysis. High-Performance
computing” and the HiPEAC Visions of 2017 and Data Analysis (HPDA) will complement simulation in
2019 see the time to “re-invent computing”, both by future HPC applications.
challenging its basic assumptions. This document
The second trend is the emergence of cloud comput-
presents the second edition of the “EuroLab4HPC
ing and warehouse-scale computers (also known as
Long-Term Vision on High-Performance Computing”
data centres). Data centres consist of low-cost volume
of January 20201 , a road mapping effort within the
processing, networking and storage servers, aiming
EC CSA2 EuroLab4HPC that targets potential changes
at cost-effective data manipulation at unprecedented
in hardware, software, and applications in High-
scales. The scale at which they host and manipulate
Performance Computing (HPC).
(e.g., personal, business) data has led to fundamental
The objective of the EuroLab4HPC Vision is to pro- breakthroughs in data analytics.
vide a long-term roadmap from 2023 to 2030 for High-
There are a myriad of challenges facing massive data
Performance Computing (HPC). Because of the long-
analytics including management of highly distributed
term perspective and its speculative nature, the au-
data sources, and tracking of data provenance, data
thors started with an assessment of future computing
validation, mitigating sampling bias and heterogene-
technologies that could influence HPC hardware and
ity, data format diversity and integrity, integration,
software. The proposal on research topics is derived
security, privacy, sharing, visualization, and massively
from the report and discussions within the road map-
parallel and distributed algorithms for incremental
ping expert group. We prefer the term “vision” over
and/or real-time analysis.
“roadmap”, firstly because timings are hard to pre-
dict given the long-term perspective, and secondly Large datacentres are fundamentally different from
because EuroLab4HPC will have no direct control over traditional supercomputers in their design, operation
the realization of its vision. and software structures. Particularly, big data appli-
cations in data centres and cloud computing centres
require different algorithms and differ significantly
The Big Picture from traditional HPC applications such that they may
not require the same computer structures.
High-performance computing (HPC) typically targets
With modern HPC platforms being increasingly built
scientific and engineering simulations with numer-
using volume servers (i.e. one server = one role),
ical programs mostly based on floating-point com-
there are a number of features that are shared among
putations. We expect the continued scaling of such
warehouse-scale computers and modern HPC plat-
scientific and engineering applications to continue
forms, including dynamic resource allocation and
well beyond Exascale computers.
management, high utilization, parallelization and ac-
However, three trends are changing the landscape celeration, robustness and infrastructure costs. These
for high-performance computing and supercomput- shared concerns will serve as incentives for the con-
ers. The first trend is the emergence of data analyt- vergence of the platforms.
ics complementing simulation in scientific discovery. There are, meanwhile, a number of ways that tradi-
While simulation still remains a major pillar for sci- tional HPC systems differ from modern warehouse-
ence, there are massive volumes of scientific data that scale computers: efficient virtualization, adverse net-
1
https://www.eurolab4hpc.eu/vision/ work topologies and fabrics in cloud platforms, low
2
European Commission Community and Support Action memory and storage bandwidth in volume servers.
Executive Summary 3
HPC customers must adapt to co-exist with cloud ser- 1 Exaflops with less than 30 MW power consumption
vices; warehouse-scale computer operators must in- requiring processors with a much better performance
novate technologies to support the workload and plat- per Watt than available today. On the other side, em-
form at the intersection of commercial and scientific bedded computing needs high performance with low
computing. energy consumption. The power target at the hard-
ware level is widely the same, a high performance per
It is unclear whether a convergence of HPC with big
Watt.
data applications will arise. Investigating hardware
and software structures targeting such a convergence In addition to mastering the technical challenges, re-
is of high research and commercial interest. However, ducing the environmental impact of upcoming com-
some HPC applications will be executed more econom- puting infrastructures is also an important matter.
ically on data centres. Exascale and post-Exascale Reducing CO emissions and overall power consump-
2
supercomputers could become a niche for HPC appli- tion should be pursued. A combination of hardware
cations. techniques, such as new processor cores, accelerators,
The third trend arises from Deep Neural Networks memory and interconnect technologies, and software
(DNN) for back propagation learning of complex pat- techniques for energy and power management will
terns, which emerged as new technique penetrat- need to be cooperatively deployed in order to deliver
ing different application areas. DNN learning re- energy-efficient solutions.
quires high performance and is often run on high-
performance supercomputers. GPU accelerators are Because of the foreseeable end of CMOS scaling, new
seen as very effective for DNN computing by their en- technologies are under development, such as, for ex-
hancements, e.g. support for 16-bit floating-point and ample, Die Stacking and 3D Chip Technologies, Non-
tensor processing units. It is widely assumed that it volatile Memory (NVM) Technologies, Photonics, Re-
will be applied in future autonomous cars thus open- sistive Computing, Neuromorphic Computing, Quan-
ing a very large market segment for embedded HPC. tum Computing, and Nanotubes. Since it is uncertain
DNNs will also be applied in engineering simulations if/when some of the technologies will mature, it is
traditionally running on HPC supercomputers. hard to predict which ones will prevail.
Embedded high-performance computing demands are The particular mix of technologies that achieve com-
upcoming needs. It may concern smartphones but mercial success will strongly impact the hardware and
also applications like autonomous driving, requiring software architectures of future HPC systems, in par-
on-board high-performance computers. In partic- ticular the processor logic itself, the (deeper) memory
ular the trend from current advanced ADAS (auto- hierarchy, and new heterogeneous accelerators.
matic driving assistant systems) to piloted driving
and to fully autonomous cars will increase on-board There is a clear trend towards more complex systems,
performance requirements and may even be coupled which is expected to continue over the next decade.
with high-performance supercomputers in the Cloud. These developments will significantly increase soft-
The target is to develop systems that adapt more ware complexity, demanding more and more intelli-
quickly to changing environments, opening the door gence across the programming environment, includ-
to highly automated and autonomous transport, capa- ing compiler, run-time and tool intelligence driven by
ble of eliminating human error in control, guidance appropriate programming models. Manual optimiza-
and navigation and so leading to more safety. High- tion of the data layout, placement, and caching will
performance computing devices in cyber-physical sys- become uneconomic and time consuming, and will, in
tems will have to fulfil further non-functional require- any case, soon exceed the abilities of the best human
ments such as timeliness, (very) low energy consump- programmers.
tion, security and safety. However, further applica-
If accurate results are not necessarily needed, another
tions will emerge that may be unknown today or that
speedup could emerge from more efficient special
receive a much higher importance than expected to-
execution units, based on analog, or even a mix be-
day.
tween analog and digital technologies. Such devel-
Power and thermal management is considered as opments would benefit from more advanced ways to
highly important and will continue its preference in reason about the permissible degree of inaccuracy in
future. Post-Exascale computers will target more than calculations at run time. Furthermore, new memory
4 Executive Summary
technologies like memristors may allow on-chip in- from data mining, clustering and structure detection.
tegration, enabling tightly-coupled communication This requires an evolution of the incumbent standards
between the memory and the processing unit. With such as OpenMP to provide higher-level abstractions.
the help of memory computing algorithms, data could An important question is whether and to what degree
be pre-processed “in-” or “near-” memory. these fundamental abstractions may be impacted by
disruptive technologies.
But it is also possible that new hardware develop-
ments reduce software complexity. New materials
could be used to run processors at much higher fre- The Work Needed
quencies than currently possible, and with that, may
even enable a significant increase in the performance As new technologies require major changes across the
of single-threaded programs. stack, a vertical funding approach is needed, from ap-
Optical networks on die and Terahertz-based connec- plications and software systems through to new hard-
tions may eliminate the need for preserving locality ware architectures and potentially down to the en-
since the access time to local storage may not be as abling technologies. We see HP Lab’s memory-driven
significant in future as it is today. Such advancements computing architecture “The Machine” as an exem-
will lead to storage-class memory, which features sim- plary project that proposes a low-latency NVM (Non-
ilar speed, addressability and cost as DRAM combined Volatile Memory) based memory connected by pho-
with the non-volatility of storage. In the context of tonics to processor cores. Projects could be based
HPC, such memory may reduce the cost of checkpoint- on multiple new technologies and similarly explore
ing or eliminate it entirely. hardware and software structures and potential appli-
cations. Required research will be interdisciplinary.
The adoption of neuromorphic, resistive and/or quan- Stakeholders will come from academic and industrial
tum computing as new accelerators may have a dra- research.
matic effect on the system software and programming
models. It is currently unclear whether it will be suf-
ficient to offload tasks, as on GPUs, or whether more The Opportunity
dramatic changes will be needed. By 2030, disrup-
tive technologies may have forced the introduction The opportunity may be development of competitive
of new and currently unknown abstractions that are new hardware/software technologies based on up-
very different from today. Such new programming coming new technologies to advantageous position
abstractions may include domain-specific languages European industry for the future. Target areas could
that provide greater opportunities for automatic opti- be High-Performance Computing and Embedded High-
mization. Automatic optimization requires advanced Performance devices. The drawback could be that the
techniques in the compiler and runtime system. We chosen base technology may not be prevailing but be
also need ways to express non-functional properties replaced by a different technology. For this reason,
of software in order to trade various metrics: perfor- efforts should be made to ensure that aspects of the de-
mance vs. energy, or accuracy vs. cost, both of which veloped hardware architectures, system architectures
may become more relevant with near threshold, ap- and software systems could also be applied to alter-
proximate computing or accelerators. native prevailing technologies. For instance, several
Nevertheless, today’s abstractions will continue to NVM technologies will bring up new memory devices
evolve incrementally and will continue to be used well that are several magnitudes faster than current Flash
beyond 2030, since scientific codebases have very long technology and the developed system structures may
lifetimes, on the order of decades. easily be adapted to the specific prevailing technolo-
gies, even if the project has chosen a different NVM
Execution environments will increase in complexity technology as basis.
requiring more intelligence, e.g., to manage, analyse
and debug millions of parallel threads running on het-
erogeneous hardware with a diversity of accelerators, EC Funding Proposals
while dynamically adapting to failures and perfor-
mance variability. Spotting anomalous behavior may The Eurolab4HPC vision recommends the following
be viewed as a big data problem, requiring techniques funding opportunities for topics beyond Horizon 2020
Executive Summary 5
(ICT):
• Convergence of HPC and HPDA:
– Data Science, Cloud computing and HPC:
Big Data meets HPC
– Inter-operability and integration
– Limitations of clouds for HPC
– Edge Computing: local computation for pro-
cessing near sensors
• Impact of new NVMs:
– Memory hierarchies based on new NVMs
– Near- and in-memory processing: pre- and
post-processing in (non-volatile) memory
– HPC system software based on new memory
hierarchies
– Impact on checkpointing and reciliency
• Programmability:
– Hide new memory layers and HW accelera-
tors from users by abstractions and intelli-
gent programming environments
– Monitoring of a trillion threads
– Algorithm-based fault tolerance techniques
within the application as well as moving
fault detection burden to the library, e.g.
fault-tolerant message-passing library
• Green ICT and Energy
– Integration of cooling and electrical subsys-
tem
– Supercomputer as a whole system for Green
ICT
As remarked above, projects should be interdisci-
plinary, from applications and software systems
through hardware architectures and, where relevant,
enabling hardware technologies.
6 Executive Summary
Contents
Executive Summary 3
1 Introduction 9
1.1 Related Initiatives within the European Union . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Working Towards the Revised Eurolab4HPC Vision . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Document Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Technology 19
3.1 Digital Silicon-based Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.1 Continuous CMOS scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.2 Die Stacking and 3D-Chips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Memristor-based Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.1 Memristor Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.2 Multi-level-cell (MLC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.3 Memristive Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2.4 Neuromorphic and Neuro-Inspired Computing . . . . . . . . . . . . . . . . . . . . . . 42
3.3 Applying Memristor Technology in Reconfigurable Hardware . . . . . . . . . . . . . . . . . . 49
3.4 Non-Silicon-based Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.4.1 Photonics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.4.2 Quantum Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.4.3 Beyond CMOS Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Contents 7
4.2.2 Near Memory Computing NMC of COM-N . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.2.3 In-Memory Computing (In-Memory Processing, IMP) . . . . . . . . . . . . . . . . . . . 77
4.2.4 Potential and Challenges for In-memory Computing . . . . . . . . . . . . . . . . . . . 78
4.3 New Hardware Accelerators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.4 New Ways of Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.4.1 New Processor Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.4.2 Power is Most Important when Committing to New Technology . . . . . . . . . . . . . 81
4.4.3 Locality of References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.4.4 Digital and Analog Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.4.5 End of Von Neumann Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.4.6 Summary of Potential Long-Term Impacts of Disruptive Technologies for HPC Software
and Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6 Vertical Challenges 95
6.1 Green ICT and Power Usage Effectiveness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.2 Resiliency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.3 Impact of Memristive Memories on Security and Privacy . . . . . . . . . . . . . . . . . . . . . 97
6.3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.3.2 Memristors and Emerging Non-Volatile-Memorys (NVMs): Security Risks . . . . . . . . 98
6.3.3 Memristors and Emerging NVMs: Supporting Security . . . . . . . . . . . . . . . . . . 99
6.3.4 Memristors, Emerging NVMs and Privacy . . . . . . . . . . . . . . . . . . . . . . . . . 100
8 Contents
1 Introduction
Upcoming application trends and disruptive VLSI developing a pan-European supercomputing infras-
technologies will change the way computers will be tructure and supporting research and innovation ac-
programmed and used as well as the way computers tivities related to HPC.
will be designed. New application trends such as High-
Performance Data Analysis (HPDA) and deep-learning The Eurolab4HPC vision complements existing ef-
will induce changes in High-Performance Computing; forts such as the ETP4HPC Strategic Research Agenda
disruptive technologies will change the memory hi- (SRA). ETP4HPC is an industry-led initiative to build
erarchy, hardware accelerators and even potentially a globally competitive HPC system value chain. De-
lead to new ways of computing. The HiPEAC Visions velopment of the Eurolab4HPC vision is aligned with
of 2017 and 20191 see the time to revisit the basic ETP4HPC SRA in its latest version from September
concepts: The US wants to “reboot computing”, the 2017. SRA 2017 was targeting a roadmap towards
HiPEAC Vision proposes to “re-invent computing” by Exascale computers that spans until approximately
challenging basic assumptions such as binary coding, 2022, whereas the new SRA 2019/2020 is expected to
interrupts, layers of memory, storage and computa- cover 2021-2027 as it was advanced in the ‘Blueprint
tion. for the new Strategic Research Agenda for High Per-
formance Computing’5 published in April 2019. The
This document has been funded by the EC CSA Eurolab4HPC visions target the speculative period be-
Eurolab4HPC-2 project (June 2018 - May 2020), a suc- yond Exascale, so approximately beyond 2023-2030
cessor of EC CSA Eurolab4HPC (Sept. 2015 – August and from a technology push point of view.
2017) project. It outlines a long-term vision for ex-
cellence in European High-Performance Computing The Eurolab4HPC vision also complements the PRACE
research, with a timescale beyond Exascale comput- Scientific Case6 that has the main focus in the future
ers, i.e. a timespan of approximately 2023-2030. It of HPC applications for the scientific and industrial
delivers a thorough update of the Eurolab4HPC Vision communities. Its 3rd edition covers the timeframe
of 20172 . An intermediate step between the Visions 2018-2026. PRACE (Partnership for Advanced Comput-
of 2017 and the current one of January 2020 has been ing in Europe) is the main public European providers
reached by the Memristor Report3 compiled by an of HPC infrastructure for scientific discovery. On the
expert group of the two German computer science applications side, a set of Centers of Excellence on
associations "Gesellschaft für Informatik" and "Infor- HPC Applications have been promoted by the Euro-
mationstechnische Gesellschaft" in June 2019. pean Commission to stimulate the adoption of HPC
technologies among a variety of end-user communi-
ties. Those are crucial in the co-design of the future
disruptive upstream technologies.
1.1 Related Initiatives within the
European Union The Eurolab4HPC vision is developed in close collabo-
ration with the “HiPEAC Vision” of HiPEAC CSA that
features the broader area of “High Performance and
Nowadays the European effort is driven by the Eu-
Embedded Architecture and Compilation”. The Euro-
roHPC Joint Undertaking4 . The entity started opera-
lab4HPC vision complements the HiPEAC Vision 2019
tions in November 2018, with the main objectives of
document with a stronger focus on disruptive tech-
1
www.hipeac.net/publications/vision nologies and HPC.
2
https://www.eurolab4hpc.eu/vision/
3 5
https://fb-ti.gi.de/fileadmin/FB/TI/user_upload/ https://www.etp4hpc.eu/pujades/files/Blueprint%
Memristor_Report-2019-06-27.pdf 20document_20190904.pdf
4 6
https://eurohpc-ju.europa.eu/ http://www.prace-ri.eu/third-scientific-case/
Introduction 9
The creation and growth of an HPC ecosystem has 2. Assess the potential hardware architectures and
been supported by European Commission and struc- their characteristics.
tured by CSA7 instrument. Eurolab4HPC represents
3. Assess what that could mean for the different
most relevant HPC system experts in academia. Most
HPC aspects.
of current research and development projects and
innovation initiatives are led or participated by Eu- The vision roughly follows the structure: “IF technol-
rolab4HPC members, who are individuals committed ogy ready THEN foreseeable impact on WG topic could
to strengthen the network. EXDCI’s main partners be ...”
are PRACE and ETP4HPC, thus representing main HPC
stakeholders such as infrastructure providers and in- The second edition of the Vision updates the sections
dustry. On the HPC application side, FocusCoE is the of the first edition and restructures the complete doc-
initiative aiming to support the Centers of Excellence ument.
in HPC applications. The new Vision was again developed by a single expert
working group:
Sandro Bartolini, Luca Benini, Koen Bertels, Spyros
CoEs HPC Applications FOCUSCOE
Blanas, Uwe Brinkschulte, Paul Carpenter, Giovanni
HPC Infra- HPC Industry De Micheli, Marc Duranton, Babak Falsafi, Dietmar
structure PRACE ETP4HPC Fey, Said Hamdioui, Christian Hochberger, Avi Mendel-
son, Dominik Meyer, Ilia Polian, Ulrich Rückert, Xavier
EXDCI
Salazar, Werner Schindler, and Theo Ungerer. Simon
McIntosh–Smith and Igor Zacharov, both contributing
Technology Push
Codesign
Industry Pull
7
https://ec.europa.eu/research/participants/
Section 3 focuses on Technologies, i.e. silicon-based
data/ref/h2020/other/wp/2018-2020/annexes/ (CMOS scaling and 3D-chips), memristor-based, and
h2020-wp1820-annex-d-csa_en.pdf non-silicon-based (Photonics, Quantum Computing
10 Introduction
and beyond CMOS) technologies. This section is fol-
lowed by section 4 that summarizes the Potential Long-
Term Impacts of Disruptive Technologies for HPC
Hardware and Software in separate subsections.
Section 5 covers System Software and Programming
Environment challenges, and finally Section 6 cov-
ers Green ICT, Resiliency and Security and Privacy as
Vertical Challenges.
Introduction 11
2 Overall View of HPC
The Eurolab4HPC-2 Vision targets particularly tech- USA: The U.S. push towards Exascale is led by the
nology, architecture and software of postExascale HPC DoE’s Exascale Computing Project2 and its extended
computers, i.e. a period of 2023 til 2030. PathForward program landing in the 2021 – 2022
timeframe with “at least one” Exascale system. This
The supercomputers of highest performance are listed roadmap was confirmed in June 2017 with a DoE an-
in the Top 500 Lists1 which is updated twice a year. nouncement that backs six HPC companies as they
The June 2019 shows two IBM-built supercomputers, create the elements for next-generation systems. The
Summit (148.6 petaflops) and Sierra (94.6 petaflops) vendors on this list include Intel, Nvidia, Cray, IBM,
in the first two positions, followed by the Chinese AMD, and Hewlett Packard Enterprise (HPE) [2].
supercomputers Sunway TaihuLight (93.0 petaflops) The Department of Energy’s new supercomputer Au-
and the Tianhe-2A (Milky Way-2A) (61.4 petaflops). rora will be built by Intel and Cray at Argonne, and it
Performance assessments are based on the ‘best’ per- will be the first of its kind in the United States with
formance LINPACK Rmax as measured by the LINPACK costs of estimated 500 million Dollar. It is scheduled
Benchmark. A second list is provided based on the to be fully operational by the end of 2021. Aurora will
High-Performance Conjugate Gradient (HPCG) Bench- also be set up as a perfect platform for deep learn-
mark featuring again Summit and Sierra on top and ing [3]. At the Exascale Day October 21, 2019 it was
the Japanese K computer third. revealed that Aurora shall be based on Next-Gen-Xeon
processors and Intel-Xe-GPUs. Two even faster super-
All these performance data is based on pure petaflops computers with 1.5 exaflops are in the line for 2021
measurements which is deemed to be too restricted and 2022: Frontier based on AMD Next-Gen-Epyc pro-
for useful future supercomputers. Exascale does not cessors with Radeon graphic cards and El Captain with
merely refer to a LINPACK Rmax of 1 exaflops. The up to now unknown hardware [4].
PathForward definition of a capable Exascale system
is focused on scientific problems rather than bench-
China has a good chance of reaching exascale com-
marks, as well as raising the core challenges of power
puting already in 2020. China’s currently fastest listed
consumption and resiliency: “a supercomputer that
supercomputer, the Sunway TaihuLight contains en-
can solve science problems 50X faster (or more com-
tirely Chinese-made processing chips. The Chinese
plex) than on the 20 Petaflop systems (Titan and Se-
government is funding three separate architectural
quoia) of 2016 in a power envelope of 20-30 megawatts,
paths to attain that exascale milestone. This internal
and is sufficiently resilient that user intervention due
competition will pit the National University of De-
to hardware or system faults is on the order of a week
fense Technology (NUDT), the National Research Cen-
on average” [1]. Lastly Exascale computing refers to
ter of Parallel Computer and Sugon (formerly Dawn-
computing systems capable of at least one exaflops,
ing) against one another to come up with the coun-
however, on real applications, not just benchmarks.
try’s (and perhaps the world’s) first exascale super-
Such applications comprise not only traditional super-
computer. Each vendor has developed and deployed a
computer applications, but also neural network learn-
512-node prototype system based on what appears to
ing applications and interconnections with HPDA.
be primarily pre-exascale componentry in 2018 [5].
1 2
https://www.top500.org/lists/ https://www.exascaleproject.org/
Japan: The successor to the K supercomputer, which 2.3 Convergence of HPDA and HPC
is being developed under the Flagship2020 program,
will use ARM-based A64FX processors and these chips
2.3.1 Convergence of HPC and Cloud
will be at the heart of a new system built by Fujitsu for
Computing
RIKEN (Japan’s Institute of Physical and Chemical Re-
search) that would break the Exaflops barrier [7] aim-
High-performance computing refers to technologies
ing to make the system available to its users “around
that enable achieving a high-level computational ca-
2021 or 2022.” [8]
pacity as compared to a general-purpose computer
[13]. High-performance computing in recent decades
has been widely adopted for both commercial and re-
European Community: The former EC President search applications including but not limited to high-
Juncker has declared that the European Union has to frequency trading, genomics, weather prediction, oil
be competitive in the international arena with regard exploration. Since inception of high-performance
to the USA, China, Japan and other stakeholders, in computing, these applications primarily relied on sim-
order to enhance and promote the European industry ulation as a third paradigm for scientific discovery
in the public as well as the private sector related to together with empirical and theoretical science.
HPC [9].
The technological backbone for simulation has been
high-performance computing platforms (also known
The first step will be “Extreme-Scale Demonstrators”
as supercomputers) which are specialized computing
(EsDs) that should provide pre-Exascale platforms de-
instruments to run simulation at maximum speed
ployed by HPC centres and used by Centres of Excel-
with lesser regards to cost. Historically these plat-
lence for their production of new and relevant appli-
forms were designed with specialized circuitry and
cations. Such demonstrators are planned by ETP4HPC
Initiative and included in the EC LEIT-ICT 2018 calls. 3 https://www.european-processor-initiative.eu/
have trailed behind supercomputers in adopting the [3] T. Verge. America’s first exascale supercomputer to be built
by 2021. url: https : / / www . theverge . com / 2019 / 3 /
latest network fabrics, switches and interface tech- 18/18271328/supercomputer-build-date-exascale-
nologies. In contrast, supercomputers have been in- intel-argonne-national-laboratory-energy.
corporated networks with a higher bisection band- [4] S. Bauduin. Exascale Day: Cray spricht über kommende Super-
width, with the latest in high-bandwidth fabrics and computer. url: https://www.computerbase.de/2019-
interfaces and programmable switches available in 10/cray-exascale-day-supercomputer/.
the market irrespective of cost. Because the network [5] M. Feldman. China Fleshes Out Exascale Design for Tianhe-3
fabrics are slated to improve by 20% per year in the Supercomputer. url: https://www.nextplatform.com/
2019/05/02/china- fleshes- out- exascale- design-
next decade and beyond with improvements in optical
for-tianhe-3/.
interconnects, a key differentiator between datacen-
[6] Z. Zhihao. China to jump supercomputer barrier. 2017. url:
ters and supercomputers is network performance and http://www.chinadaily.com.cn/china/2017-02/20/
provisioning. content_28259294.htm.
[7] T. P. Morgan. Inside Japan’s Future Exascale ARM Supercom-
puter. 2016. url: https : / / www . nextplatform . com /
2.3.5 Cloud-Embedded HPC and Edge 2016/06/23/inside- japans- future- exaflops- arm-
Computing supercomputer.
[8] M. Feldman. Japan Strikes First in Exascale Supercomputing
Battle. url: https://www.nextplatform.com/2019/04/
The emergence of data analytics for sciences and ware- 16/japan-strikes-first-in-exascale-supercomput
house scale computing will allow much of the HPC that ing-battle/.
Technology 19
Intel plans 7 nm FinFET for production in early to Research Perspective
mid-2020, according to industry sources. Intel’s 5 nm
production is targeted for early 2023, sources said, “It is difficult to shed a tear for Moore’s Law when
meaning its traditional 2-year process cadence is ex- there are so many interesting architectural distrac-
tending to roughly 2.5 to 3 years [6]. WikiChip gives tions on the systems horizon” [12]. However, silicon
the Note: For the most part, foundries’ 7nm process technology scaling will still continue and research in
is competing against Intel’s 10nm process, not their silicon-based hardware is still prevailing, in particu-
7nm [7]. lar targeting specialized and heterogeneous processor
structures and hardware accelerators.
TSMC plans to ship 5 nm in 2020, which is also ex-
pected to be a FinFET. In reality, though, TSMC’s 5 nm However, each successive process shrink becomes
will likely be equivalent in terms of specs to Intel’s more expensive and therefore each wafer will be more
7 nm, analysts said [6]. expensive to deliver. One trend to improve the den-
sity on chips will be 3D integration also of logic. Hard-
TSMC started production of their 7 nm HK-MG FinFET ware structures that mix silicon-based logic with new
process in 2017 and is already actively in development NVM technology are upcoming and intensely inves-
of 5 nm process technology as well. Furthermore, tigated. A revolutionary DRAM/SRAM replacement
TSMC is also in development of 3nm process tech- will be needed [1].
nology. Although 3 nm process technology already As a result, non-silicon extensions of CMOS, using III–V
seems so far away, TSMC is further looking to collabo- materials or Carbon nanotube/nanowires, as well as
rate with academics to begin developing 2 nm process non-CMOS platforms, including molecular electronics,
technology [8]. spin-based computing, and single-electron devices,
have been proposed [1].
Samsung’s newest foundry process technologies and
solutions introduced at the annual Samsung Foundry For a higher integration density, new materials and
Forum include 8 nm, 7 nm, 6 nm, 5 nm, 4 nm in its processes will be necessary. Since there is a lack of
newest process technology roadmap [9]. However, no knowledge of the fabrication process of such new ma-
time scale is provided. terials, the reliability might be lower, which may re-
sult in the need of integrated fault-tolerance mecha-
Samsung will use EUVL for their 7nm node and thus nisms [1].
will be the first to introduce this new technology after
more than a decade of development. On May 24 2017, Research in CMOS process downscaling and building
Samsung released a press release of their updated fabs is driven by industry, not by academic research.
roadmap. Due to delays in the introduction of EUVL, Availability of such CMOS chips will be a matter of
Samsung will introduce a new process called 8 nm LPP, costs and not only of availability of technology.
to bridge the gap between 10 nm and 7 nm [7].
References
GlobalFoundries: As of August 2018 GlobalFoundries
has announced they will suspend further develop- [1] Semiconductor Industry Association. ITRS 2.0, Executive Sum-
mary 2015 Edition. url: http://cpmt.ieee.org/images/
ment of their 7nm, 5nm and 3nm process.
files/Roadmap/ITRSHetComp2015.pdf.
Expected to be introduced by the foundries TSMC [2] Semiconductor Industry Association. ITRS 2.0, 2015 Edition, Exec-
utive Report on DRAM. url: https://www.semiconductors.
and Samsung in 2020 (and by Intel in 2013), the 5- org / clientuploads / Research _ Technology / ITRS /
nanometer process technology is characterized by its 2015/0_2015%20ITRS%202.0%20Executive%20Report%
use of FinFET transistors with fin pitches in the 20s 20(1).pdf.
of nanometer and densest metal pitches in the 30s [3] HiPEAC Vision 2015. url: www.hipeac.net/v15.
of nanometers. Due to the small feature sizes, these [4] WikiChip. 7 nm lithography process. 2019. url: https://en.
processes make extensive use of EUV for the critical wikichip.org/wiki/10_nm_lithography_process.
dimensions [10]. [5] A. Shilov. Samsung and TSMC Roadmaps: 8 and 6 nm Added,
Looking at 22ULP and 12FFC. url: http://www.anandtech.
Not much is known about the 3 nm technology. Com- com/show/11337/samsung- and- tsmc- roadmaps- 12-
mercial integrated circuit manufacturing using 3 nm nm-8-nm-and-6-nm-added.
process is set to begin some time around 2023 [11].
20 Technology
[6] M. Lapedus. Uncertainty Grows For 5 nm, 3 nm. url: https:and circuits in the interposer itself. This more ad-
/ / semiengineering . com / uncertainty - grows - for - vanced and high cost integration approach is much
5nm-3nm/.
more flexible than passive interposers, but it is also
[7] WikiChip. 7 nm lithography process. 2019. url: https://en. much more challenging for design, manufacturing,
wikichip.org/wiki/7_nm_lithography_process.
test and thermal management. Hence it is not yet
[8] S. Chen. TSMC Already Working on 5 nm, 3 nm, and Planning widespread in commercial products.
2 nm Process Nodes. url: https://www.custompcreview.c
om/news/tsmc-already-working-5nm-3nm-planning- The advantages of 2.5D technology based on chiplets
2nm-process-nodes/33426/.
and interposers are numerous: First, short communi-
[9] Samsung. Samsung Set to Lead the Future of Foundry with cation distance between dies and finer pitch for wires
Comprehensive Process Roadmap Down to 4 nm. url: https:
/ / news . samsung . com / global / samsung - set - to -
in the interposer (compared to traditional PCBs), thus
lead-the-future-of-foundry-with-comprehensive- reducing communication load and then reducing com-
process-roadmap-down-to-4nm. munication power consumption. Second, the possi-
[10] WikiChip. 5 nm lithography process. 2019. url: https://en. bility of assembling on the same interposer dies from
wikichip.org/wiki/5_nm_lithography_process. various heterogeneous technologies, like DRAM and
[11] WikiChip. 3 nm lithography process. 2019. url: https://en. non-volatile memories, or even photonic devices, in
wikichip.org/wiki/3_nm_lithography_process. order to benefit of the best technology where it fits
[12] N. Hemsoth. Neuromorphic, Quantum, Supercomputing Mesh best. Third, an improved system yield and cost by
for Deep Learning. url: https://www.nextplatform.com/ partitioning the system in a divide-and-conquer ap-
2017/03/29/neuromorphic-quantum-supercomputing proach: multiple dies are fabricated, tested and sorted
-mesh-deep-learning/.
before the final 3D assembly, instead of fabricating
ultra-large dies with much reduced yield. The main
challenges for 2.5D technology are manufacturing cost
3.1.2 Die Stacking and 3D-Chips (setup and yield optimization) and thermal manage-
ment since cooling high-performance requires com-
Die Stacking and three-dimensional chip integration plex packages, thermal coupling materials and heat
denote the concept of stacking integrated circuits (e.g. spreaders, and chiplets may have different thermal
processors and memories) vertically in multiple layers. densities (e.g. logic dies have typically much higher
3D packaging assembles vertically stacked dies in a heat per unit area dissipation than memories). Passive
package, e.g., system-in-package (SIP) and package- silicon interposers connecting chiplets in heteroge-
on-package (POP). neous technologies are now mainstream in HPC prod-
uct: AMD EPYC processors, integrating 7nm and 14nm
Die stacking can be achieved by connecting sepa- logic chiplets and NVIDIA TESLA GPGPUs integrating
rately manufactured wafers or dies vertically either logic and DRAM chiplets (high-bandwidth memory -
via wafer-to-wafer, die-to-wafer, or even die-to-die. HBM - interface) are most notable examples.
The mechanical and electrical contacts are realized
either by wire bonding as in SIP and POP devices or True 3D integration, where silicon dies are vertically
microbumps. SIP is sometimes listed as a 3D stacking stacked on top of each other is most advanced and
technology, although it is more precisely denoted as commercially available in memory chips. The leading
2.5D technology. technology for 3D integration is based on Through-
Silicon Vias (TSVs) and it is widely deployed in DRAMs.
An evolution of SIP approach which is now extremely In fact, the DRAMs used in high-bandwidth memory
strategic for HPC systems consists of stacking multiple (HBM) chiplets are made with multiple stacked DRAM
dies (called chiplets) on a large interposer that pro- dies connected by TSVs. Hence, TSV-based 3D technol-
vides connectivity among chiplets and to the package. ogy and interposer-based 2.5D technology are indeed
The interposer can be passive or active. A passive in- combined when assembling an HBM multi-chiplet sys-
terposer, implemented with silicon or with an organic tem.
material to reduce cost, provides multiple levels of
metal interconnects and vertical vias for inter-chiplet However, TSVs are not the densest 3D connectivity
connectivity and for redistribution of connections to option. 3D-integrated circuits can also be achieved
the package. It also provides micropads for the con- by stacking active layers vertically on a single wafer
nection of the chiplets on top. Active silicon inter- in a monolithic approach. This kind of 3D chip inte-
posers offer the additional possibility to include logic gration does not use micro-pads or Through-Silicon
Technology 21
Vias (TSVs) for communication, but it uses vertical The latest HBM version is based on the HBM2E spec,
interconnects between layers, with a much finer pitch which has 8/16GB capacities. It has 1,024 I/Os with
than that allowed by TSVs. The main challenge in 3.2Gbps transfer rates, achieving 410GB/s of band-
monolithic integration is to ensure that elementary width. HBM2E is sampling and it is expected to reach
devices (transistors) have similar quality level and the market in 2020. The next version, HBM3, has
performance in all the silicon layers. This is a very 4Gbps transfer rates with 512GB/s bandwidth, and
challenging goal since the manufacturing process is it is planned for 2020/21. HBM is also very common in
not identical for all the layers (low temperature pro- specialized machine learning accelerators, such as Ha-
cesses are needed for the layers grown on top of the bana Lasb’s (recently bought by Intel) Gaudi AI Train-
bulk layer). However, monolithic 3D systems are cur- ing Processor. The Gaudi processor includes 32GB of
rently in volume production, even though for very HBM-2 memory.
specialized structures, namely, 3D NAND flash memo-
HBM stacks DRAM dies on top of each other and con-
ries. These memories have allowed flash technology
nects them with TSVs. For example, Samsung’s HBM2
to scale in density beyond the limits of 2D integration
technology consists of eight 8Gbit DRAM dies, which
and they are now following a very aggressive roadmap
are stacked and connected using 5,000 TSVs. The band-
towards hundreds of layers.
width advantage for HBM with respect to standard
While TSV-based and monolithic 3D technologies are DDR memories is staggering: HBM2 enables 307GB/s
already mature and in production for memories, they of data bandwidth, compared to 85.2GB/s with four
are still in prototyping stage for logic, due to a number DDR4 DIMMs. Recently, Samsung introduced a new
of technical challenges linked to the requirement of HBM version that stacks 12 DRAM dies, which are con-
faster transistors, the extremely irregular connections nected using 60,000 TSVs. The package thickness is
and the much higher heat density that characterize similar to the 8-die stack version. This HBM flavor
logic processes. is for data-intensive applications, like AI and HPC. It
achieves 24 gigabytes of density. That’s a 3x improve-
Some advanced solutions for vertical die-to-die com- ment over the prior generation.
munication do not require ohmic contact in metal,
i.e. capacitive and inductive coupling as well as short- HBM is probably the most advanced and well-defined
range RF communication solutions that do not require 2.5D interface standard used today in HPC across mul-
a flow of electrons passing through a continuous metal tiple vendors. However, 2.5D chiplet integration tech-
connection. These approaches are usable both in die- nology is also heavily used by AMD in their Zen 2 EPYC
stacked and monolithic flavors, but the transceivers server processors (codenamed Rome), to integrate up
and modulator/demodulator circuits do take space to 64 cores within a 5-chiplet package with silicon
and vertical connectivity density is currently not bet- interposer. 2.5D approaches are also heavily used in
ter than that of TSVs, but could scale better in multi- high-end FPGAs from both Intel (Altera) and Xilinx, to
layer stacks. These three-dimensional die-to-die con- integrate in the same package multiple FPGA dies, as
nectivity options are not currently available in com- well as HBM memories. Both CPU and FPGA use propri-
mercial devices, but their usability and cost is under etary protocols and interfaces for their inter-chiplet
active exploration. connectivity, as opposed to the memory-chiplet con-
nectivity in HBM which is standardized by JEDEC.
22 Technology
they are also very heavily used in mobile phones, as Also the field of contactless connectivity for 3D inte-
they allow very small form factors. Current products gration is in an exploratory phase, with a number of
have up to 128 3D NAND flash cell layers although different options being considered namely, capacitive,
volume shipments are for 96 layers or less. By 2020 inductive coupling and short-range RF. These alter-
128-layer 3D NAND products will be in volume produc- native approaches do not require ohmic connections
tion with 192-layer 3D NAND probably sampling. By between dies, and hence are potentially more flexible
2022 3D NAND flash with over 200 layers will probably in terms of interconnection topologies implementable
be available. However, the manufacturing cost grows in the 3D stack. However, their maturity level is lower
with the number of flash layers, so number of layers than that of TSVs and their cost and density need to
does not translate linearly into a storage capacity cost be optimized for production [11].
reduction. For this reason, NAND flash companies are
also pushing multi-bit per cell (three and four bit per
cell) for enterprise and client applications. Perspective
Flash memories are not the only non-volatile mem- 2.5D is now the to-go technology for HPC. All ma-
ories to follow a 3D-integration roadmap. Intel and jor silicon vendors in the HPC space (Intel, AMD,
Micron announced "3D XPoint" memory already in NVIDIA) have solid roadmaps based on 2.5D ap-
2015 (assumed to be 10x the capacity of DRAM and proaches, namely HBM and chiplet integration. In gen-
1000x faster than NAND Flash [2]). Intel/Micro 3D- eral, we will see for sure increased use of 2.5D chiplet
Xpoint memory has been commercially available as technology, not only to tackle the memory bandwidth
Optane-SSDs DC P4800X-SSD as 375-Gbyte since March bottleneck (the primary goal of HBM), but also to im-
2017 and stated to be 2.5 to 77 times "better" than prove yield, by integrating multiple smaller chips on
NAND-SSDs. Even though the Optane product line has a large interposer. In an alternative view, chiplets
encountered several roadmap issues (technical and also can be used to increase the die size to 1000mm2 ,
business-related), it is now actively marketed by In- which is larger than reticle size, by using a common in-
tel. Optane products are also developed as persistent- terconnect and putting many homogeneous chiplets
memory modules, which can only be used with Intel’s on a substrate to build a huge 2.5D-integrated multi-
Cascade Lake Xeon CPUs, available in 128 GB, 256 GB chiplet "mega-chip". It is important to note that 2.D
or 512 GB capacities. A second-generation Optane per- technology is very flexible: it can be used to connect
sistent memory module, code-named Barlow Pass and chips developed at different process nodes, depend-
a second-generation Optane SSD, code-named Alder ing on what makes the most sense for a particular
Stream, are planned for release by the end of 2020. function. This is done today in modern CPUs from
AMD, which use different logic technologies in their
For what concerns 3D logic technology, prototypes chiplet-based processors (7nm for processor chiplets
date back to 2004 when Tezzaron released a 3D IC mi- and 14nm for a IO-central-hub chiplet).
crocontroller [3]. Intel evaluated chip stacking for a
Pentium 4 already in 2006 [4]. Early multicore designs We expect to see many more variations of 2.5D in-
using Tezzaron’s technology include the 64 core 3D- tegration in the next 5-10 years, with the densest
MAPS (3D Massively Parallel processor with Stacked logic technology (5nm will potentially be introduced
memory) research prototype from 2012 [5] and the in 2020/21) used for compute-intensive chiplets and
Centip3De with 64 ARM Cortex-M3 Cores also from differentiated, possibly less scaled technology for dif-
2012 [6]. Fabs are able to handle 3D packages (e.g. [7]). ferent functions, such as storage, IO, accelerators. It
In 2011 IBM announced 3D chip production process [8]. is quite easy to envision that in the 2020 decade 2.5D
3D-Networks-on-Chips for connecting stacks of logic technology will be essential for maintaining the pace
dies have been demonstrated in 2011[9]. All these of performance and energy efficiency evolutions and
early prototypes, based on TSV approaches, have not compensate for the slowdown of Moore’s law. A key
reached product maturity. More recent research on 3D enabler for 2.5D technology development is the defini-
logic is focusing on monolithic integration, where mul- tion of standard protocols for chiplet-to-chiplet com-
tiple layers of active devices are fabricated together munication. Currently there are several proprietary
using multiple lithographic steps on the same silicon protocols, but an evolution toward a multi-vendor
die, and on reducing the pitch of TSVs by new wafer- standard is highly desirable: a candidate is Bunch of
scale assembly processes [10]. Wires (BoW), the new chiplet interface proposed by
Technology 23
the OCP ODSA group, designed to address the inter- All 3D solid-state memory applications will benefit
face void for organic substrates [12]. Significant in- from developments of interface technologies that al-
novation will come from interposer technology: or- low utilizing their inherent higher performance with
ganic substrates are aggressively developed to com- respect to HDD and traditional flashes, especially for
pete with silicon in terms of connection density and write operation. In particular, the NVMe protocol,
bandwidth density for passive interposers. Silicon in- based upon the PCIe bus, and the use of this protocol
terposers are moving toward active solutions, such as supported over various storage fabric technologies
the FOVEOS approach developed by Intel, with prod- (NVMe over fabric, or NVMe-oF), combined with soft-
ucts announced in late 2020. ware and firmware, are becoming key enablers in the
development of the modern storage and memory hi-
3D stacked memories using TSVs (HBM DRAMs) and
erarchy.
monolithic integration (3D NAND Flash) are now main-
stream in HPC and they are here to stay, with solid
For what concerns the roadmap of three-dimensional
and aggressive roadmaps. Various alternative non-
integration for logic processes (including SRAM mem-
volatile memory technologies are also heavily rely-
ories), future perspectives are blurrier. To the best of
ing on 3D integration. The Xpoint technology, based
our knowledge, there are no volume commercial prod-
on 3D (multi-layer) phase change memory (PCM) has
ucts using logic die stacking for high-performance
already reached volume production and is available
computing (or computing in general) and no prod-
as a niche option in HPC. Other technologies are ac-
uct announcements have been made by major players.
tively explored, such as magnetic random access mem-
This is mainly due to the lack of a compelling value
ory (MRAM), ferroelectric RAM (FRAM), resistive RAM
proposition. Current production-ready TSV-based 3D
(RRAM). These memories will have a hard time to com-
integration technology does not offer enough verti-
pete in cost as solid-state storage options against 3D
cal connectivity density and bandwidth density to
Flash and the still cost-competitive traditional hard-
achieve a performance boost that would justify the
disk drives. On the other hand, an area of growth for
cost and risk to achieve production-quality 3D-stacks
these new non-volatile memories is in the memory
of logic dies. Similarly, monolithic 3D technologies
hierarchy as complement of replacement for DRAM
have not yet been able to demonstrate sufficiently
main memory. Octane DRAM-replacing DIMMs in
high added value, due to the performance deteriora-
particular are intended for use with Intel’s advanced
tion of transistors implemented within higher layers
server processors and Intel is using this technology
of the chip stack.
to differentiate from competitors for the next genera-
tion of server CPUs. Other HPC manufacturers, such
This situation is probably going to change in the next
as Cray/HPE are using Optane memory in their stor-
five years, as scaled transistors (sub-5nm) are mov-
age systems as an intermediate storage element in an
ing toward true three-dimensional structures, such as
effort to reduce DRAM, as a write cache to achieve
the "gate all around" devices demonstrated at large
higher endurance NAND flash and other applications.
scale of integration [13]. These devices offer disrup-
This is because Optane memory sells for a per capacity
tive options for integration in the vertical dimension,
price between NAND flash and DRAM.
creating new avenues to implement even truly mono-
We expect non-Flash 3D NV memory technologies lithic three-dimensional elementary gates. Additional
to start competing in the HPC space in the next five options are offered by "buried layer" metallization:
years, as several semiconductor foundries are offer- for instance, new high-density SRAM cells can be en-
ing MRAM (as well as RRAM) as options for embed- visioned in advanced nodes exploiting buried Vdd dis-
ded memory to replace NOR, higher level (slower) tribution [14].
SRAM and some DRAM. Some embedded products us-
ing MRAM for inference engine weight memory ap- A key challenge in establishing full-3D logic chip stack-
plications have appeared in 2019. Probably the most ing technology is gaining control of the thermal prob-
short-term entry for MRAM and RRAM technology is lems that have to be overcome to realize reliably very
as embedded memories on logic SoCs, but coupling dense 3D stacks working at high frequency. This re-
these memories in multi-layer monolithic 3D configu- quires the availability of appropriate design tools,
rations, possibly as chiplets in 2.5D integrated systems, which are explicitly supporting 3D layouts. Both top-
as done today for HBM DRAM, opens exciting innova- ics represent an important avenue for research in the
tion and differentiation perspectives for HPC. next 10 years.
24 Technology
Impact on Hardware devices and wires. Niche application may be covered
by non-ohmic 3D connections.
Full 3D-stacking has multiple potential beneficial im- A collateral but very interesting trend is 3D stacking
pacts on hardware in general and on the design of of sensors. Sony is market leader in imaging sen-
future processor-memory-architectures in particu- sors and it uses extensively 3D stacking technology to
lar. Wafers can be partitioned into smaller dies be- combine image sensors directly with column-parallel
cause comparatively long horizontally running links analogue-digital-converters and logic circuits [16, 17].
are relocated to the third dimension and thus enable This trend will open the opportunity for fabricating
smaller form factors, as done today for 3D memories. fully integrated systems that will also include sensors
3D stacking also enables heterogeneity, by integrating and their analogue-to-digital interfaces. While this in-
layers, manufactured in different processes, e.g., dif- tegration trend won’t impact directly the traditional
ferent memory technologies, like SRAM, DRAM, Spin- market for HPC chips, it will probably gain traction in
transfer-torque RAM (STT-RAM) and also memristor many high growth areas for embedded HPC.
technologies. Due to short connection wires, reduc-
tion of power consumption is to be expected. Simul-
taneously, a high communication bandwidth between Funding Perspectives
layers can be expected leading to particularly high
processor-to-memory bandwidth, if memories can be It is now clear that more and more hardware devices
monolithically integrated with logic gates. will use 3D technology and virtually all HPC machines
in the future will use chips featuring some form of
The last-level caches will probably be the first to be three-dimensional integration. Hence, circuit-level
affected by 3D stacking technologies when they will and system-level design will need to increasingly be-
enter logic processes. 3D caches will increase band- come 3D-aware. Moreover, some flavors of three-
width and reduce latencies by a large cache memory dimensional IC technology are now being commodi-
stacked on top of logic circuitry. In a further step tized, with foundries offering 2.5D integration options
it is consequent to expand 3D chip integration also even for startups and R&D projects. As a consequence,
to main memory in order to make a strong contribu- three-dimensional technology won’t be accessible as
tion in reducing decisively the current memory wall internal technology only to multi-billion-dollar in-
which is one of the strongest obstructions in getting dustry players. Given the already demonstrated im-
more performance in HPC systems. Furthermore, pos- pact and rapidly improving accessibility and cost, def-
sibly between 2025 and 2030, local memories and some initely the EU needs to invest in research on how to
arithmetic units will undergo the same changes end- develop components and systems based on 3D tech-
ing up in complete 3D many-core microprocessors, nology. It is also clear that technology development
which are optimized in power consumption due to of 3D technology is getting increasingly strategic and
reduced wire lengths and denser 3D cells. In-memory- hence significant R&D investments are needed also
3D processing, where computation and storage are in this capital-intensive area for Europe to remain
integrated on the vertical dimension at a very fine competitive in HPC.
pitch is also another long-term promising direction,
with early adoption in product for specialized compu-
tation (e.g. Neural networks [15]): several startups are References
active in this area and have announced products (e.g. [1] NVIDIA. NVIDIA Tesla P100 Whitepaper. 2016. url: https:
Crossbar Inc.) and some large companies (e.g. IBM) //images.nvidia.com/content/pdf/tesla/whitepap
have substantial R&D effort in this field. er/pascal-architecture-whitepaper.pdf.
[2] Intel. Intel® Optane™: Supersonic memory revolution to take-
It is highly probable that 2.5D (chiplets) and full 3D off in 2016. 2016. url: http://www.intel.eu/content/
(monolithic integration) will continue to coexist and www / eu / en / it - managers / non - volatile - memory -
idf.html.
partially merge in the 2020 decade. Most ICs will con-
[3] Tezzaron Semiconductor. Tezzaron 3D-IC Microcontroller Pro-
sist of multiple chiplets integrated on interposers (pos-
totype. 2016. url: http : / / www . tachyonsemi . com /
sibly active ones), and chiplets themselves will be true OtherICs/3D-IC%5C_8051%5C_prototype.htm.
3D integrated stacks based on high density (micro-
meter pitch) TSV connections as well as truly mono-
lithic ultra-high density (Nano-meter pitch) vertical
Technology 25
[4] B. Black et al. “Die Stacking (3D) Microarchitecture”. In:
Proceedings of the 39th Annual IEEE/ACM International Sympo-
sium on Microarchitecture. MICRO 39. Washington, DC, USA,
2006, pp. 469–479. doi: 10.1109/MICRO.2006.18. url:
https://doi.org/10.1109/MICRO.2006.18.
[5] D. H. Kim et al. “3D-MAPS: 3D Massively parallel processor
with stacked memory”. In: 2012 IEEE International Solid-State
Circuits Conference. Feb. 2012, pp. 188–190. doi: 10.1109/
ISSCC.2012.6176969.
[6] D. Fick et al. “Centip3De: A 3930DMIPS/W configurable
near-threshold 3D stacked system with 64 ARM Cortex-M3
cores”. In: 2012 IEEE International Solid-State Circuits Confer-
ence. Feb. 2012, pp. 190–192. doi: 10.1109/ISSCC.2012.
6176970.
[7] 3D & Stacked-Die Packaging Technology Solutions. 2016. url:
http://www.amkor.com/go/3D-Stacked-Die-Packagi
ng.
[8] IBM. IBM setzt erstmals 3D-Chip-Fertigungsverfahren ein. 2016.
url: http://www-03.ibm.com/press/de/de/pressrel
ease/36129.wss.
[9] G. V. der Plas et al. “Design Issues and Considerations for
Low-Cost 3-D TSV IC Technology”. In: IEEE Journal of Solid-
State Circuits 46.1 (Jan. 2011), pp. 293–307.
[10] S. V. Huylenbroeck et al. “A Highly Reliable 1.4?m Pitch
Via-Last TSV Module for Wafer-to-Wafer Hybrid Bonded
3D-SOC Systems”. In: IEEE 69th Electronic Components and
Technology Conference (ECTC). 2019, pp. 1035–1040.
[11] V. F. P. I. A. Papistas and D. Velenis. “Fabrication Cost Anal-
ysis for Contactless 3-D ICs”. In: IEEE Transactions on Circuits
and Systems II: Express Briefs 66.5 (May 2019), pp. 758–762.
[12] url: https://github.com/opencomputeproject/ODSA-
BoW.
[13] A. Veloso et al. “Vertical Nanowire and Nanosheet FETs:
Device Features, Novel Schemes for improved Process Con-
trol and Enhanced Mobility, Potential for Faster & More
Energy-Efficient Circuits”. In: IEDM (2019).
[14] S. M. Salahuddin et al. “SRAM With Buried Power Distri-
bution to Improve Write Margin and Performance in Ad-
vanced Technology Nodes”. In: IEEE Electron Device Letters
40.8 (Aug. 2019), pp. 1261–1264.
[15] M. Gao et al. “TETRIS: Scalable and Efficient Neural Net-
work Acceleration with 3D Memory”. In: International Con-
ference on Architectural Support for Programming Languages
and Operating Systems. Apr. 2017, pp. 751–764.
[16] T. Kondo et al. “A 3D stacked CMOS image sensor with
16Mpixel global-shutter mode and 2Mpixel 10000fps mode
using 4 million interconnections”. In: 2015 Symposium on
VLSI Circuits (VLSI Circuits). June 2015, pp. C90–C91. doi:
10.1109/VLSIC.2015.7231335.
[17] R. Muazin. Sony Stacked Sensor Presentation (ISSCC 2013). 2013.
url: http://image- sensors- world- blog.blogspot.
de / 2013 / 02 / isscc - 2013 - sony - stacked - sensor .
html.
26 Technology
3.2 Memristor-based Technology Memristor Defined by Leon Chua’s System Theory
Technology 27
A memristor itself is a special case of a memristive • STT-RAMs (Spin-Transfer Torque RAMs) as newer
system with only one state variable, x. Such a memris- technology that uses spin-aligned ("polarized")
tive system is either current-controlled (3.3), in which electrons to directly torque the domains, and
case the internal state variable is the charge, q, con-
trolled by a current I, and an output voltage, V, or it • NRAM (Nano RAM) based on Carbone-Nanotube-
is voltage-controlled (3.4), in which case the system’s Technique.
state variable is the flux, φ, controlled by the voltage
V, and the output of the system is the current, I. The functioning of the just listed technologies are now
described in more details.
Insulator
Insulator
Insulator
• PCM (Phase Change Memory), which switches crys-
Heater
Heater
talline material, e.g. chalcogenide glass, be-
tween amorphous and crystalline states by heat
produced by the passage of an electric current Bottom Electrode Bottom Electrode
through a heating element,
(a) high resistance (b) low resistance
• ReRAM (Resistive RAM) with the two sub-classes
– CBRAM (Conductive Bridge RAM), which gen- Figure 3.2: PCM cell structure [5]
erates low resistance filament structures
between two metal electrodes by ions ex- In this structure a phase change material layer is
change, sandwiched between two electrodes. When current
passes through the heater it induces heat into the
– OxRAM (Metal Oxide Resistive RAM) consists of phase change layer and thereby eliciting the struc-
a bi-layer oxide structure, namely a metal- ture change. To read the data stored in the cell, a low
rich layer with lower resistivity (base layer) amplitude reading pulse is applied, that is too small
and an oxidised layer with higher resistiv- to induce phase change. By applying such a low read-
ity. The ratio of the height of these two lay- ing voltage and measuring the current across the cell,
ers and by that the resistance of the whole its resistance and hence the binary stored value can
structure can be changed by redistribution be read out. To program the PCM cell into high re-
of oxygen vacancies, sistance state, the temperature of the cell has to be
– DioxRAM (Diode Metal Oxide Resistive RAM), in higher than the melting temperature of the material,
which oxygen vacancies are redistributed while to program the cell into the low resistance state
and trapped close to one of the two metal the temperature of the cell must be well above the
electrodes and lower the barrier height of crystalizing temperature and below melting temper-
corresponding metal electrode. ature for a duration sufficient for crystallization to
take place [5].
• MRAM (Magnetoresistive RAM) storing data by mag-
netic tunnel junctions (MTJ), which is a compo- PCM [6, 7, 8, 9, 10] can be integrated in the CMOS
nent consisting of two ferromagnets separated process and the read/write latency is only by tens
by a thin insulator, of nanoseconds slower than DRAM whose latency is
28 Technology
roughly around 100ns. The write endurance is a hun- Ti ions in BiFeO3 [14], close to one of the two metal
dred million or up to hundreds of millions of writes electrodes. The accumulation of oxygen vacancies
per cell at current processes. The resistivity of the lowers the barrier height of the corresponding metal
memory element in PCM is more stable than Flash; electrode [15]. If both metal electrodes have a recon-
at the normal working temperature of 85 °C, it is pro- figurable barrier height, the DioxRAM works as a com-
jected to retain data for 300 years. Moreover, PCM ex- plementary resistance switch [16]. The resistance of
hibits higher resistance to radiation than Flash mem- the DioxRAM depends on the amplitude of writing
ory. PCM is currently positioned mainly as a Flash bias and can be controlled in a fine-tuned analog man-
replacement. ner [17]. Local ion irradiation improves the resistive
switching at normal working temperature of 85 °C
memristor symbol
+
[18].
-
+ +
The endurance of ReRAM devices can be more than
Pt Cu
Cu 50 million cycles and the switching energy is very
+
ions
low low [19]. ReRAM can deliver 100x lower read latency
+
TiO2-x
+
w
+
resistance
+
+
+
+
+
TiO2 SiO2 Cu
relatively low energy and with high speed featuring
fila- read/write latencies close to DRAM.
ment
high
Pt resistance
Pt MRAM is a memory technology that uses the mag-
- - netism of electron spin to provide non-volatility with-
Figure 3.3: Scheme for OxRAM and CBRAM based out wear-out. MRAM stores information in magnetic
memristive ReRAM devices. material integrated with silicon circuitry to deliver
the speed of SRAM with the non-volatility of Flash in
ReRAM or also called RRAM offers a simple cell struc- a single unlimited-endurance device. Current MRAM
ture which enables reduced processing costs. Fig. 3.3 technology from Everspin features a symmetric read-
shows the technological scheme for ReRAM devices /write access of 35 ns, a data retention of more than
based on OxRAM or CBRAM. the different non-volatile 20 years, unlimited endurance, and a reliability that
resistance values are stored as follows. exceeds 20 years lifetime at 125 °C. It can easily be
integrated with CMOS. [21]
In CBRAM [11] metal is used to construct the filaments,
e.g. by applying a voltage on the top copper electrode MRAM requires only slightly more power to write than
Cu+ ions are moving from the top electrode to the bot- read, and no change in the voltage, eliminating the
tom negative electrode made in platinum. As result need for a charge pump. This leads to much faster
the positively charged copper ions reoxidize with elec- operation and lower power consumption than Flash.
trons and a copper filament is growing that offers a Although MRAM is not quite as fast as SRAM, it is close
lower resistance. By applying an opposite voltage this enough to be interesting even in this role. Given its
filament is removed and the increasing gap between much higher density, a CPU designer may be inclined
the tip of the filament and the top electrode increases to use MRAM to offer a much larger but somewhat
resulting in a higher resistance. In an OxRAM-based slower cache, rather than a smaller but faster one.
ReRAM [12, 13] oxygen ionization is exploited for the [22]
construction of layers with oxygen vacancies which
have a lower resistivity. The thickness ratio in a bi- STT (spin-transfer torque or spin-transfer switching) is a
layer oxide structure between the resistance switch- newer MRAM technology technique based on Spin-
tronics, i.e. the technology of manipulating the spin
ing layer with higher resistivity, e.g. TiO2-x , and the
state of electrons. STT uses spin-aligned ("polarized")
base metal-rich layer with lower resistivity, e.g. TiO2 ,
see Fig. 3.3, is changed by redistribution of oxygen electrons to directly torque the domains. Specifi-
vacancies. cally, if the electrons flowing into a layer they have to
change their spin, this will develop a torque that will
In bipolar OxRAM-based ReRAMs (DioxRAM), i.e. both be transferred to the nearby layer. This lowers the
electrodes can be connected to arbitrary voltages, oy- amount of current needed to write the cells, making
gen vacancies are redistributed and trapped, e.g. by it about the same as the read process.
Technology 29
Instead of using the electrons charge, spin states can Slesazeck from NaMLab for a comparison of memris-
be utilized as a substitute in logical circuits or in tradi- tive devices with different Hf02 -based ferroelectric
tional memory technologies like SRAM. An STT-RAM memory technologies, FeRAM, FeFET, and ferroelec-
[23] cell stores data in a magnetic tunnel junction tric tunnelling junction devices (FTJs), which can also
(MTJ). Each MTJ is composed of two ferromagnetic be used for the realization of non-volatile memories.
layers (free and reference layers) and one tunnel bar- The table is just a snapshot of an assessment. Assess-
rier layer (MgO). If the magnetization direction of the ments of other authors differ widely in terms of better
magnetic fixed reference layer and the switchable free of worse values concerning different features. The In-
layer is anti-parallel, resp. parallel, a high, resp. a low, ternational Technology Roadmap for Semiconductors
resistance is adjusted, representing a digital "0" or "1". (ITRS 2013) [30] reports an energy of operation of 6 pJ
Also multiple states can be stored. In [24] was reported and projects 1 fJ for the year 2025 for PCMs. Jeong
that by adjusting intermediate magnetization angles and Shi [31] report in 2019 an energy of operation of
in the free layer 16 different states can be stored in one 80 fJ to 0.03 nJ for prototype and research PCM devices,
physical cell, enabling to realize multi-cell storages in 0.1 pJ to 10 nJ for RAM based devices, whereas the com-
MTJ technology. mercial OxRAM based ReRAMs from Panasonic have
a write speed of 100 ns and an energy value of 50 pJ
The read latency and read energy of STT-RAM is ex- per memory cell. A record breaking energy efficiency
pected to be comparable to that of SRAM. The ex- is published in Vodenicarevic [32] for STT-MRAMs
pected 3x higher density and 7x less leakage power with 20 fJ/bit for a device area of 2 µm2 , compared
consumption in the STT-RAM makes it suitable for to 3 pJ/bit and 4000 µm2 for a state-of-the-art pure
replacing SRAMs to build large NVMs. However, a CMOS solution. The price for this perfect value is a
write operation in an STT-RAM memory consumes 8x limited speed dynamics of a few dozens MHz. How-
more energy and exhibits a 6x longer latency than ever, for embedded IoT devices this can be sufficient.
a SRAM. Therefore, minimizing the impact of ineffi- Despite of this distinguishing numbers it is clear that
cient writes is critical for successful applications of these devices offer a lot of potential and it is to expect
STT-RAM [25]. that some of this potential can be exploited for future
NRAMs, a proprietary technology of Nantero, are a computer architectures.
very prospective NVM technology in terms of manu- The NVSim simulator [33] is popular in computer ar-
facturing maturity, according to their developers. The chitecture science research to assess architectural
NRAMs are based on nano-electromechanical carbon structures based on the circuit-level performance, en-
nano tube switches (NEMS). In [27, 28] pinched hys- ergy and area model of emerging non-volatile memo-
teresis loops are shown for the current-voltage curve ries. It allows the investigation of architectural struc-
for such NEMS devices. Consequently, also NEMS and tures for future NVM based high-performance com-
NRAMs are memristors according to Leon Chua’s the- puters. Nevertheless, there is still a lot of work to do
ory. The NRAM uses a fabric of carbon nanotubes on the tool side. Better models for memristor tech-
(CNT) for saving bits. The resistive state of the CNT nology, both physical and analytical ones, have to be
fabric determines, whether a one or a zero is saved in a integrated in the tools and besides that also the mod-
memory cell. The resistance depends on the width of a els themselves have to be fine tuned.
bridge between two CNT. With the help of a small volt-
age, the CNTs can be brought into contact or be sepa-
rated. Reading out a bit means to measure the resis- Multi-Level Cell Capability of Memristors
tance. Nantero claims that their technology features
show the same read- and write latencies as DRAM, has
a high endurance and reliability even in high temper- One of the most promising benefits that memristive
ature environments and is low power with essentially technologies like ReRAM, PCMs, or STT-RAMs offer
zero power consumption in standby mode. Further- is their capability of storing more than two bits in
more NRAM is compatible with existing CMOS fabs one physical storage cell. MLC is necessary if memris-
without needing any new tools or processes, and it is tors are used to emulate synaptic plasticity [34] (see
scalable even to below 5 nm [29]. Sect. 3.2.4. Compared to conventional SRAM or DRAM
storage technology this is an additional qualitative
Fig. 3.4 gives an overview of some memristive de- advantage to their feature of non-volatility. In litera-
vices’ characteristics which was established by Stefan ture this benefit is often denoted as multi-level-cell
30 Technology
Figure 3.4: Snapshot of different memristive devices’ characteristics and conventional Si-based memory
technologies, established by S. Slesazeck // reprinted from S.Yu, P.-Y. Chen, Emerging Memory //
Technologies, 2016 [26]
(MLC) or sometimes also as multi-bit capability. The Adesto Technologies is offering CBRAM technology
different memristive technologies offer different ben- in their serial memory chips [37]. The company re-
efits and drawbacks among each other concerning the cently announced it will present new research show-
realization of the MLC feature. Details about these ing the significant potential for Resistive RAM (RRAM)
benefits and drawbacks as well as the possibilities of technology in high-reliability applications such as
usage of this MLC feature in future computing sys- automotive. RRAM has great potential to become
tems for caches, associative memories and ternary a widely used, low-cost and simple embedded non-
computing schemes can be found in Sect. 3.2.2. volatile memory (NVM), as it utilizes simple cell struc-
tures and materials which can be integrated into ex-
isting manufacturing flows with as little as one addi-
Current State tional mask. Adesto’s RRAM technology (trademarked
as CBRAM), making it a promising candidate for high-
The above mentioned memristor technologies PCM, reliability applications. CBRAM consumes less power,
ReRAMs, MRAM, STT-RAM, an advanced MRAM tech- requires fewer processing steps, and operates at lower
nology which uses a spin-polarized current instead of voltages as compared to conventional embedded flash
a magnetic field to store information in the electron’s technologies [38].
spin, allowing therefore higher integration densities,
and NRAMs are memristor candidates, which are al- MRAM is a NVM technology that is already available
ready commercialized or close to commercialization today, however in a niche market. MRAM chips are
according to their manufacturers. produced by Everspin Technologies, GlobalFoundries
and Samsung [22].
Intel and Micron already deliver the new 3D XPoint
memory technology [35] as flash replacement which Everspin delivered in 2017 samples of STT-MRAMs
is based on PCM technology. Their Optane-SSDs 905P in perpendicular Magnetic Tunnel Junction Process
series is available on the market and offers 960 GByte (pMTJ) as 256-MBit-MRAMs und 1 GB-SSDs. Samsung
for an about four times higher price than current 1 is developing an MRAM technology. IBM and Samsung
TByte SSD-NAND flash SSDs but provides 2.5 to 77 reported already in 2016 an MRAM device capable of
times better performance than NAND-SSDs. Intel and scaling down to 11 nm with a switching current of
Micron expect that the X-Point technology could be- 7.5 microamps at 10 ns [22]. Samsung and TSMC are
come the dominating technology as an alternative to producing MRAM products in 2018.
RAM devices offering in addition NVM property in
Everspin offers in August 2018 a 256Mb ST-DDR3 STT-
the next ten years. But the manufacturing process is
MRAM storage device designed for enterprise-style
complicated and currently, devices are expensive.
applications like SSD buffers, RAID buffers or syn-
IBM published in 2016 achieved progress on a multi- chronous logging applications where performance is
level-cell (MLC-)PCM technology [36] replacing Flash critical and endurance is a must. The persistence of
and to use them e.g. as storage class memory (SCM) of STT-MRAM protects data and enables systems to dra-
supercomputers to fill the latency gap between DRAM matically reduce latency, by up to 90%, boosting per-
main memory and the hard disk based background formance and driving both efficiency and cost savings
memory. [21]. Everspin is focusing with their MRAM products
Technology 31
on areas where there is a need for fast, persistent mem- technologies have been considered as a feasible re-
ory by offering near-DRAM performance combined placement for SRAM [42, 43, 44]. Studies suggest that
with non-volatility. replacing SRAM with STT-RAM could save 60% of LLC
energy with less than 2% performance degradation
Right now, the price of MRAM is still rather high, but it
[42].
is the most interesting emerging memory technology
because its performance is close to SRAM and DRAM,
Besides the potential as memories, memristors which
and its endurance is very high. MRAM makes sense for
are complementary switches offer a highly promising
cache buffering, and for specific applications, such as
approach to realize memory and logic functionality
the nvNITRO NVMe storage accelerator for financial
in a single device, e.g. for reconfigurable logics [16],
applications, where “doing a transaction quickly is
and memristors with multi-level cell capabilities en-
important, but having a record is just as important"
able the emulation of synaptic plasticity [34] to realize
[39].
neuromorphic computing, e.g. for machine learning
TSMC is also developing Embedded MRAM and Em- with memristor-based neural networks.
bedded ReRAM, as indicated by the TSMC roadmap in
2018 [40]. One of the challenges for the next decade is the provi-
sion of appropriate interfacing circuits between the
Nantero together with Fujitsu announced a Multi-GB- SCMs, or NVM technologies in general, and the mi-
NRAM memory in Carbone-Nanotube-Technique ex- croprocessor cores. One of the related challenges in
pected for 2018. Having acquired the license to pro- this context is the developing of efficient interface
duce Nantero’s NRAM (Nano-RAM), Fujitsu targets circuits in such a way that this additional overhead
2019 for NRAM Mass Production. Nantero’s CNT-based will not corrupt the benefits of memristor devices in
devices can be fabricated on standard CMOS produc- integration density, energy consumption and access
tion equipment, which may keep costs down. NRAM times compared to conventional technologies.
could be Flash replacement, able to match the densi-
ties of current Flash memories and, theoretically, it STT-RAM devices primarily target the replacement of
could be made far denser than Flash. DRAM, e.g., in Last-Level Caches (LLC). However, the
Nantero also announced a multi-gigabyte DDR4- asymmetric read/write energy and latency of NVM
compatible MRAM memory with speed comparable technologies introduces new challenges in designing
to DRAM at a lower price per gigabyte. Cache, based memory hierarchies. Spintronic allows integration of
on nonvolatile technology, will remove the need for logic and storage at lower power consumption. Also
battery backup. Nantero said that this allows for a dra- new hybrid PCM / Flash SSD chips could emerge with
matic expansion of cache size, substantially speeding a processor-internal last-level cache (STT-RAM), main
up the SSD or HDD. Embedded memory will eventu- processor memory (ReRAM, PCRAM), and storage class
ally be able to scale to 5nm in size (the most advanced memory (PCM or other NVM).
semiconductors are being produced at the 10-nm and
7-nm nodes); operate at DRAM-like speeds, and op- All commercially available memristive memories fea-
erate at very high temperature, said Nantero. The ture better characteristics than Flash, however, are
company said that the embedded memory devices will much more expensive. It is unclear when most of the
be well-suited for several IoT applications, including new technologies will be mature enough and which of
automotive. [41] them will prevail by a competitive price. “It’s a verita-
ble zoo of technologies and we’ll have to wait and see
which animals survive the evolutionary process," said
Perspective Thomas Coughlin, founder of Coughlin Associates.
It is foreseeable, that memristor technologies will su- One of the most promising benefits that memristive
persede current Flash memory. Memristors offer or- technologies like ReRAM, PCMs, or STT-RAMs offer
ders of magnitude faster read/write accesses and also is their capability of storing more than two bits in
much higher endurance. They are resistive switch- one physical storage cell. Compared to conventional
ing memory technologies, and thus rely on different SRAM or DRAM storage technology this is an addi-
physics than that of storing charge on a capacitor as is tional qualitative advantage to their feature of non-
the case for SRAM, DRAM and Flash. Some memristor volatility.
32 Technology
3.2.2 Multi-level-cell (MLC) which different tasks store their cached values in the
same cache line.
The different memristive technologies offer different Another more or less recent state-of-the-art applica-
benefits and drawbacks among each other concern- tion is their use in micro-controller units as energy-
ing the realization of the MLC feature. E.g., one of efficient, non-volatile check-pointing or normally-
the main challenges in MLC-PCM systems is the read off/instant-on operation with near zero latency boot
reliability degradation due to resistance drift [45]. Re- as it was just announced by the French start-up com-
sistance drift means that the different phase states pany eVaderis SA [50].
in the used chalcogenide storage material can over-
lap since each reading step changes a little bit the To this context also belongs research work on ternary
phase what is not a real problem in single-level cells content-addressable memories (TCAM) with memris-
(SLC) but in MLCs. In a recently published work the tive devices, in which the third state is used for the
impressive number of 92 distinct resistance levels was realization of the don’t care state in TCAMs. In many
demonstrated for a so-called bi-layer ReRAM struc- papers, e.g. in [51], is shown that using memristive
ture [46]. In such a bi-layer structure not only one TCAMs need less energy, less area than equivalent
metal-oxide layer is used as storage material, like e.g. CMOS TCAMs. However, most of the proposed mem-
usually HfO2 or TiO2 technology, which is enclosed ristive TCAM approaches don’t exploit the MLC ca-
between a metallic top and a bottom electrode. More- pability. They are using three memristors to store 1,
over, a sequence of metal-oxide layers separated by an 0, and X (don’t care). In a next step this can be ex-
isolating layer is used leading to a better separation panded to exploit the MLC capability of such devices
of different resistance levels for the prize of a much for a further energy and area improvement.
more difficult manufacturing process. Memristive
MLC technique based on MRAM technology without
spin-polarized electrons was proposed to store up to 8 Ternary Arithmetic Based on Signed-Digit (SD)
different levels [47]. In STT-MRAM technology, using Number Systems
spin-polarized electrons, 2-bit cells are most common
and were also physically demonstrated on layout level
Another promising aspect of the MLC capability of
[48].
memristive devices is to use them in ternary arith-
metic circuits or processors based on signed-digit (SD)
number systems. In a SD number system a digit can
MLC as Memory have also a positive and a negative value, e.g. for the
ternary case we have not given a bit but a trit with the
In its general SLC form STT-MRAM is heavily discussed values, -1, 0, and +1. It is long known that ternary or
as a candidate memory technology for near-term re- redundant number systems generally, in which more
alization of future last-level-caches due to its high than two states per digit are mandatory, improve the
density characteristics and comparatively fast read- effort of an addition to a complexity of O(1) compared
/write access latencies. On academic site the next to log(N) which can be achieved in the best case with
step is discussed how to profit from the MLC capabil- pure binary adders. In the past conventional com-
ity [49]. puter architectures did not exploit this advantage of
signed-digit addition. One exception was the compute
The last example, concerning MLC caches, is repre- unit of the ILLIAC III [52] computer manufactured in
sentative for all memristive NVM technologies and the 1960’s, at a time when the technology was not so
their MLC capability. It shows that the MLC feature are mature than today and it was necessary to achieve
of interest for improving the performance of future high compute speeds with a superior arithmetic con-
computer or processor architectures. In this context cept even for paying a price of doubling the memory
they are closely related to future both near-memory requirements to store a ternary value in two physical
and in-memory computing concepts for both future memory cells. In course of the further development
embedded HPC systems and embedded smart devices the technology and pipeline processing offering la-
for IoT and CPS. For near-memory-computing archi- tency hiding, the ALUs become faster and faster and
tectures, e.g. as embedded memories, they can be it was not acceptable for storing operands given in a
used for a better high-performance multi-bit cache in redundant representation that is larger than a binary
Technology 33
one. This would double the number of registers, dou- References
ble the size of the data cache and double the necessary [1] Wikipedia. Memristor. url: http://en.wikipedia.org/
size of data segments in main memory. However, with wiki/Memristor.
the occurrence of CMOS-compatible NVM technology [2] A. L. Shimpi. Samsung’s V-NAND: Hitting the Reset Button on
offering MLC capability the situation changed. This NAND Scaling. AnandTech. 2013. url: http://www.anand
calls for a re-evaluation of these redundant computer tech.com/show/7237/samsungs-vnand-hitting-the-
arithmetic schemes under a detailed consideration of reset-button-on-nand-scaling.
the technology of MLC NVM. [3] L. Chua. “Memristor-The missing circuit element”. In: IEEE
Transactions on Circuit Theory, Vol. 18, Iss. 5. 1971.
[4] D. Strukov, G. Snider, D. Stewart, and R. Williams. “The
Perspectives and Research Challenges missing memristor found”. In: Nature 453 (2008).
[5] P. W. C. Ho, N. H. El-Hassan, T. N. Kumar, and H. A. F. Almurib.
Different work has already investigated the princi- “PCM and Memristor based nanocrossbars”. In: 2015 IEEE
15th International Conference on Nanotechnology (IEEE-NANO).
pal possibilities of ternary coding schemes using MLC
IEEE. 2015. url: https://ieeexplore.ieee.org/%20do
memristive memories. This was carried out both for cument/7388636/.
hybrid solutions, i.e. memristors are used as ternary
[6] B. C. Lee, P. Zhou, J. Yang, Y. Zhang, B. Zhao, E. Ipek, O.
memory cells for digital CMOS based logic circuits Mutlu, and D. Burger. “Phase-change Technology and the
[53], [54], and in proposals for in-memory computing Future of Main Memory”. In: IEEE micro 30.1 (2010).
like architectures, in which the memristive memory [7] C. Lam. “Cell Design Considerations for Phase Change Mem-
cell was used simultaneously as storage and as logical ory as a Universal Memory”. In: VLSI Technology, Systems
processing element as part of a resistor network with and Applications. IEEE. 2008, pp. 132–133.
dynamically changing resistances [55]. The goal of [8] B. C. Lee, E. Ipek, O. Mutlu, and D. Burger. “Architecting
this work using MLC NVM technology for ternary pro- Phase Change Memory As a Scalable Dram Alternative”.
In: Proceedings of the 36th Annual International Symposium on
cessing is not only to save latency but also to save en- Computer Architecture. ISCA ’09. 2009, pp. 2–13.
ergy since the number of elementary compute steps is
[9] M. K. Qureshi, V. Srinivasan, and J. A. Rivers. “Scalable High
reduced compared to conventional arithmetic imple- Performance Main Memory System Using Phase-change
mented in state-of-the-art processors. This reduced Memory Technology”. In: Proceedings of the 36th Annual Inter-
number of processing steps should also lead to re- national Symposium on Computer Architecture. ISCA ’09. 2009,
duced energy needs. pp. 24–33.
[10] P. Zhou, B. Zhao, J. Yang, and Y. Zhang. “A Durable and En-
As own so far unpublished work, carried out in the ergy Efficient Main Memory Using Phase Change Memory
group of the author of this chapter, shows that in Technology”. In: Proceedings of the 36th Annual International
CMOS combinatorial processing, i.e. without storing Symposium on Computer Architecture. ISCA ’09. 2009, pp. 14–
23.
the results, the energy consumption could be reduced
about 30 % using a ternary adder compared to the best [11] W. Wong. Conductive Bridging RAM. electronic design. 2014.
url: http://electronicdesign.com/memory/conduct
parallel pre-fix binary adders for a 45 nm CMOS pro- ive-bridging-ram.
cess. This advantage is lost if the results are stored in
[12] C. Xu, D. Niu, N. Muralimanohar, R. Balasubramonian, T.
binary registers. To keep this advantage and exploit Zhang, S. Yu, and Y. Xie. “Overcoming the Challenges of
it in IoT and embedded devices, which are “energy- Crossbar Resistive Memory Architectures”. In: 21st Interna-
sensible” in particular, ternary storage and compute tional Symposium on High Performance Computer Architecture
schemes based on MLC based NVMs have to be inte- (HPCA). IEEE. 2015, pp. 476–488.
grated in future in near- and in-memory computing [13] C. Xu, X. Dong, N. P. Jouppi, and Y. Xie. “Design Implica-
schemes. tions of Memristor-based RRAM Cross-point Structures”.
In: Design, Automation & Test in Europe Conference & Exhibition
To achieve this goal, research work is necessary on fol- (DATE), 2011. IEEE. 2011, pp. 1–6.
lowing topics: (i) design tools, considering automatic [14] Y. Shuai et al. “Substrate effect on the resistive switching
integration and evaluation of NVMs in CMOS, what in BiFeO3 thin films”. In: Journal of Applied Physics 111.7 (Apr.
2012), p. 07D906. doi: 10.1063/1.3672840.
(ii) requires the development of appropriate physical
models not only on analogue layer but also on logic [15] T. You et al. “Bipolar Electric-Field Enhanced Trapping
and Detrapping of Mobile Donors in BiFeO3 Memristors”.
and RTL level, (iii) appropriate interface circuitry for In: ACS Applied Materials & Interfaces 6.22 (2014), pp. 19758–
addressing NVMs, and (iv) in general the next step 19765. doi: 10.1021/am504871g.
that has to be made is going from existing concepts
and demonstrated single devices to real systems.
34 Technology
[16] T. You et al. “Exploiting Memristive BiFeO3 Bilayer Struc- [32] D. Vodenicarevic et al. “Low-Energy Truly Random Number
tures for Compact Sequential Logics”. In: Advanced Func- Generation with Superparamagnetic Tunnel Junctions for
tional Materials 24.22 (2014), pp. 3357–3365. doi: 10.1002/ Unconventional Computing”. In: Physical Review Applied 8.5
adfm.201303365. (). doi: 10.1103/PhysRevApplied.8.054045. url: https:
[17] N. Du et al. “Field-Driven Hopping Transport of Oxygen / / link . aps . org / doi / 10 . 1103 / PhysRevApplied . 8 .
Vacancies in Memristive Oxide Switches with Interface- 054045.
Mediated Resistive Switching”. In: Phys. Rev. Applied 10 (5 [33] X. Dong, C. Xu, N. Jouppi, and Y. Xie. “NVSim: A Circuit-
Nov. 2018), p. 054025. doi: 10.1103/PhysRevApplied.10. Level Performance, Energy, and Area Model for Emerging
054025. Nonvolatile Memory”. In: IEEE Transactions on Computer-
Aided Design of Integrated Circuits and Systems 31.7 (July 2012),
[18] X. Ou et al. “Forming-Free Resistive Switching in Multifer-
pp. 994–1007. doi: 10.1109/TCAD.2012.2185930.
roic BiFeO3 thin Films with Enhanced Nanoscale Shunts”.
In: ACS Applied Materials & Interfaces 5.23 (2013), pp. 12764– [34] N. Du, M. Kiani, C. G. Mayr, T. You, D. Bürger, I. Skorupa,
12771. doi: 10.1021/am404144c. O. G. Schmidt, and H. Schmidt. “Single pairing spike-timing
dependent plasticity in BiFeO3 memristors with a time
[19] B. Govoreanu et al. “10x10nm 2 Hf/HfO x crossbar resis-
window of 25 ms to 125 ms”. In: Frontiers in Neuroscience 9
tive RAM with excellent performance, reliability and low-
(June 2015), p. 227. doi: 10.3389/fnins.2015.00227.
energy operation”. In: International Electron Devices Meeting
(IEDM). IEEE. 2011, pp. 31–6. [35] J. Evangelho. Intel And Micron Jointly Unveil Disruptive, Game-
Changing 3D XPoint Memory, 1000x Faster Than NAND. Hot
[20] Crossbar. ReRAM Advantages. url: https://www.crossba
Hardware. 2015. url: https://hothardware.com/news/
r-inc.com/en/technology/reram-advantages/.
intel-and-micron-jointly-drop-disruptive-game-
[21] E. technologies. STT-MRAM Products. url: https://www. changing- 3d- xpoint- cross- point- memory- 1000x-
everspin.com/stt-mram-products. faster-than-nand.
[22] Wikipedia. Magnetoresistive random-access memory. url: h [36] N. Papandreou, H. Pozidis, T. Mittelholzer, G. Close, M. Bre-
ttp://en.wikipedia.org/wiki/Magnetoresistive_ itwisch, C. Lam, and E. Eleftheriou. “Drift-tolerant Mul-
random-access_memory. tilevel Phase-Change Memory”. In: 3rd IEEE International
[23] D. Apalkov et al. “Spin-transfer Torque Magnetic Random Memory Workshop (IMW). 2011.
Access Memory (STT-MRAM)”. In: ACM Journal on Emerging [37] Adesto Technologies. Mayriq Products. url: http://www.
Technologies in Computing Systems (JETC) 9.2 (2013), p. 13. adestotech.com/products/mavriq/.
[24] Spintronic Devices for Memristor Applications. Talk at Meeting [38] P. Release. Adesto Demonstrates Resistive RAM Technology Tar-
of EU COST ACTION MemoCis IC1401, "Memristors: at the geting High-Reliability Applications such as Automotive. url:
crossroad of Devices and Applications", Milano, 28th March https : / / www . adestotech . com / news - detail /
2016. Mar. 2016. adesto- demonstrates- resistive- ram- technology-
[25] H. Noguchi et al. “A 250-MHz 256b-I/O 1-Mb STT-MRAM targeting- high- reliability- applications- such-
with Advanced Perpendicular MTJ Based Dual Cell for Non- as-automotive/.
volatile Magnetic Caches to Reduce Active Power of Proces- [39] G. Hilson. Everspin Targets Niches for MRAM. url: https :
sors”. In: VLSI Technology (VLSIT), 2013 Symposium on. IEEE. //www.eetimes.com/document.asp?doc_id=1332871.
2013, pp. 108–109.
[40] TSMC. eFlash. url: http : / / www . tsmc . com / english /
[26] S.Yu and P.-Y. Chen. “Emerging Memory Technologies”. In: dedicatedFoundry/technology/eflash.htm.
SPRING 2016 IEEE Solid-sate circuits magazine. 2016. [41] B. Santo. NRAM products to come in 2019. url: https : / /
[27] Z. F. ; X. F. ; A. L. ; L. Dong. “Resistive switching in copper ox- www.electronicproducts.com/Digital_ICs/Memory/
ide nanowire-based memristor”. In: 12th IEEE International NRAM_products_to_come_in_2019.aspx.
Conference on Nanotechnology (IEEE-NANO). 2012. [42] H. Noguchi, K. Ikegami, N. Shimomura, T. Tetsufumi, J. Ito,
[28] Q. L. ; S.-M. K. ; C. A. R. ; M. D. E. ; J. E. Bon. “Precise Align- and S. Fujita. “Highly Reliable and Low-power Nonvolatile
ment of Single Nanowires and Fabrication of Nanoelec- Cache Memory with Advanced Perpendicular STT-MRAM
tromechanical Switch and Other Test Structures”. In: IEEE for High-performance CPU”. In: Symposium on VLSI Circuits
Transactions on Nanotechnology, Vol. 6 , Iss. 2. 2007. Digest of Technical Papers. IEEE. June 2014, pp. 1–2.
[29] Nantero. Nantero NRAM Advances in Nanotechnology. url: [43] H. Noguchi et al. “A 3.3ns-access-time 71.2 uW/MHz 1Mb
http://nantero.com/technology/. Embedded STT-MRAM Using Physically Eliminated Read-
disturb Scheme and Normally-off Memory Architecture”.
[30] International Technology Roadmap for Semiconductors (ITRS),
In: International Solid-State Circuits Conference-(ISSCC). IEEE.
Emerging research devices, 2013. url: http://www.itrs2.
2015, pp. 1–3.
net/itrs-reports.html.
[44] J. Ahn, S. Yoo, and K. Choi. “DASCA: Dead Write Prediction
[31] H. Jeong and L. Shi. “Memristor devices for neural net-
Assisted STT-RAM Cache Architecture”. In: International
works”. In: Journal of Physics D: Applied Physics 52.2 (Jan.
Symposium on High Performance Computer Architecture (HPCA).
2019), p. 023003. doi: 10.1088/1361-6463/aae223.
IEEE. 2014, pp. 25–36.
[45] W. Zhang and T. Li. “Helmet: A resistance drift resilient
architecture for multi-level cell phase change memory sys-
tem”. In: IEEE/IFIP 41st International Conference on Dependable
Systems & Networks (DSN). 2011, pp. 197–208.
Technology 35
[46] S. Stathopoulos et al. “Multibit memory operation of metal-
oxide bi-layer memristors”. In: Scientific Reports 7 (2017),
p. 17532.
[47] H. Cramman, D. S. Eastwood, J. A. King, and D. Atkinson.
“Multilevel 3 Bit-per-cell Magnetic Random Access Memory
Concepts and Their Associated Control Circuit Architec-
tures”. In: IEEE Transactions on Nanotechnology 11 (2012),
pp. 63–70.
[48] L. Xue et al. “An Adaptive 3T-3MTJ Memory Cell Design for
STT-MRAM-Based LLCs”. In: IEEE Transactions on Very Large
Scale Integration (VLSI) Systems 99 (2018), pp. 1–12.
[49] L. Liu, P. Chi, S. Li, Y. Cheng, and Y. Xie. “Building energy-
efficient multi-level cell STT-RAM caches with data com-
pression”. In: 2017 22nd Asia and South Pacific Design Automa-
tion Conference (ASP-DAC). Jan. 2017, pp. 751–756. doi: 10.
1109/ASPDAC.2017.7858414.
[50] Startup tapes out MRAM-based MCU. 2018. url: http://www.
eenewsanalog.com/news/startup- tapes- out- mram-
based-mcu.
[51] Q. Guo et al. “A Resistive TCAM Accelerator for Data-
Intensive Computing”. In: MICRO’11.
[52] D. E. Atkins. “Design of the Arithmetic Units of ILLIAC III:
Use of Redundancy and Higher Radix Methods”. In: IEEE
Transactions on Computers C-19 (1970), pp. 720–722.
[53] D. Fey, M. Reichenbach, C. Söll, M. Biglari, J. Röber, and
R. Weigel. “Using Memristor Technology for Multi-value
Registers in Signed-digit Arithmetic Circuits”. In: MEMSYS
2016. 2016, pp. 442–454.
[54] D. Wust, D. Fey, and J. Knödtel. “A programmable ternary
CPU using hybrid CMOS memristor circuits”. In: Int. JPDES
(2018). doi: https://doi.org/10.1080/17445760.2017.
1422251.
[55] A. A. El-Slehdar, A. H. Fouad, and A. G. Radwan. “Memristor-
based redundant binary adder”. In: International Conference
on Engineering and Technology (ICET). 2014, pp. 1–5.
36 Technology
3.2.3 Memristive Computing Boolean logic. Memristive circuits realizing AND gates,
OR gates, and the implication function were presented
In this section, memristive (also called resistive) com- in [10, 11, 12].
puting is discussed in which logic circuits are built by
Hybrid memristive computing circuits consist of mem-
memristors [1].
ristors and CMOS gates. The research of Singh [13],
Xia et.al. [14], Rothenbuhler et.al. [12], and Guckert
Overview of Memristive Computing and Swartzlaender [15] are representative for numer-
ous proposals of hybrid memristive circuits, in which
most of the Boolean logic operators are handled in the
Memristive computing is one of the emerging and
memristors and the CMOS transistors are mainly used
promising computing paradigms [1, 2, 3]. It takes the
for level restoration to retain defined digital signals.
data-centric computing concept much further by in-
terweaving the processing units and the memory in Figure 3.5 summarizes the activities on memristive
the same physical location using non-volatile tech- computing. We have the large block of hardware
nology, therefore significantly reducing not only the support with memristive elements for neural net-
power consumption but also the memory bottleneck. works, neuromorphic processing, and STDP (spike-
Resistive devices such as memristors have been shown timing-dependent plasticity)(see Sect. 3.2.4). Concern-
to be able to perform both storage and logic functions ing the published papers a probably much smaller
[1, 4, 5, 6]. branch of digital memristive computing with several
sub branches, like ratioed logic, imply logic or CMOS-
Memristive gates have a lower leakage power, but
like equivalent memristor circuits in which Boolean
switching is slower than in CMOS gates [7]. How-
logic is directly mapped onto crossbar topologies with
ever, the integration of memory into logic allows to
memristors. These solutions refer to pure in-memory
reprogram the logic, providing low power reconfig-
computing concepts. Besides that, proposals for hy-
urable components [8], and can reduce energy and
brid solutions exist in which the memristors are used
area constraints in principle due to the possibility of
as memory for CMOS circuits in new arithmetic cir-
computing and storing in the same device (comput-
cuits exploiting the MLC capability of memristive de-
ing in memory). Memristors can also be arranged in
vices.
parallel networks to enable massively parallel com-
puting [9].
Memristive computing provides a huge potential as Current State of Memristive Computing
compared with the current state-of the art:
A couple of start-up companies appeared in 2015 on
• It significantly reduces the memory bottleneck the market who offer memristor technology as BEOL
as it interweaves the storage, computing units (Back-end of line) service in which memristive ele-
and the communication [1, 2, 3]. ments are post-processed in CMOS chips directly on
top of the last metal layers. Also some European in-
• It features low leakage power [7].
stitutes reported just recently at a workshop meeting
• It enables maximum parallelism [3, 9] by in- “Memristors: at the crossroad of Devices and Applica-
memory computing. tions” of the EU cost action 1401 MemoCiS1 the pos-
sibility of BEOL integration of their memristive tech-
• It allows full configurability and flexibility [8]. nology to allow experiments with such technologies
• It provides order of magnitude improvements [16]. This offers new perspectives in form of hybrid
for the energy-delay product per operations, the CMOS/memristor logic which uses memristor net-
computation efficiency, and performance per works for high-dense resistive logic circuits and CMOS
area [3]. inverters for signal restoration to compensate the loss
of full voltage levels in memristive networks.
Serial and parallel connections of memristors were
Multi-level cell capability of memristive elements can
proposed for the realization of Boolean logic gates
be used to face the challenge to handle the expected
with memristors by the so-called memristor ratio logic.
huge amount of Zettabytes produced annually in a
In such circuits the ratio of the stored resistances
in memristor devices is exploited for the set-up of 1 www.cost.eu/COST_Actions/ict/IC1401
Technology 37
Memristive Computing
couple of years. Besides, proposals exist to exploit the last decade, the basic principle in the design of non-
multi-level cell storing property for ternary carry-free volatile FlipFlop (nvFF) has been to compose them
arithmetic [17, 18] or for both compact storing of keys from a standard CMOS Flip-Flop (FF) and a non-volatile
and matching operations in future associative memo- memory cell, either as part of a flip-flop memristor
ries realized with memristors [19], so-called ternary register pair or as pair of a complete SRAM cell array
content-addressable memories. and a subsequent attached memristor cell array (hy-
brid NVMs). At predefined time steps or on power loss,
this non-volatile memory cell backups the contents of
Impact on Hardware the standard FF. At power recovery, this content is re-
stored in the FF and the Non-Volatile-Processor (NVP)
can continue at the exact same state. nvFFs follow-
Using NVM technologies for resistive computing is a ing this approach require a centralized controller to
further step towards energy-aware measures for fu- initiate a backup or a restore operation. This central-
ture HPC architectures. In addition, there exist tech- ized controller has to issue the backup signal as fast
nology facilities at the IHP in Frankfurt/O which at as possible after a no-power standby, otherwise data
least for small feature sizes allow to integrate mem- and processing progress may be lost.
ristors and CMOS logic in an integrated chip without
a separate BEOL process step. It supports the realiza- Four different implementation categories of nvFFs
tion of both near-memory and in-memory comput- using hybrid retention architectures are available to-
ing concepts which are both an important brick for day:
the realization of more energy-saving HPC systems.
Near-memory could be based on 3D stacking of a logic • Ferroelectric nvFF: This category uses a ferro-
layer with DRAMs, e.g. extending Intel’s High Band- electric capacitor to store one bit. Masui et al.
width Memory (HBM) with NVM devices and stacked [21] introduced this kind of nvFFs, but different
logic circuitry in future. In-memory computing could approaches are also available.
be based on memristive devices using either ReRAM,
PCM, or STT-RAM technology for simple logic and • Magnetic RAM (MRAM) nvFF: This approach
arithmetic pre-processing operations. uses the spin direction of Magnetic Tunnel Junc-
tions to store a bit. [22]
A further way to save energy, e.g. in near-memory
computing schemes, is to use non-volatile register • CAAC-OS nvFF: CAAC-OS transistors have an ex-
cells as flip-flops or in memory cell arrays. During the tremely low off-state current. By combining
38 Technology
nvFF FeRAM MRAM ReRAM CAAC-OS
them with small capacitors a nvFF can be cre- pass transistors and a high-valued resistor [26] or by
ated[23]. The access times of these nvFFs are using a sense amplifier reading the differential state
very low. of two memristors, which are controlled by two trans-
mission gates [27]. The latter approach seems to be
• Resistive RAM (ReRAM) nvFF: ReRAMs are a beneficial in terms of performance, power consump-
special implementation of NVM using memristor tion, and robustness and shows a large potential to
technology. They do not consume any power be used for no-power standby devices which can be
in their off-state. nvFFs implementations using activated instantaneously upon an input event.
ReRAM are currently evaluated [24, 25].
Technology 39
References [12] T. Tran, A. Rothenbuhler, E. H. B. Smith, V. Saxena, and K. A.
Campbell. “Reconfigurable Threshold Logic Gates using
[1] J. Borghetti, G. S. Snider, P. J. Kuekes, J. J. Yang, D. R. memristive devices”. In: 2012 IEEE Subthreshold Microelec-
Stewart, and R. S. Williams. “Memristive switches enable tronics Conference (SubVT). Oct. 2012, pp. 1–3. doi: 10.1109/
stateful logic operations via material implication”. In: Na- SubVT.2012.6404301.
ture 464.7290 (Apr. 2010), pp. 873–876. doi: 10 . 1038 /
nature08940. url: https://www.nature.com/nature/ [13] T. Singh. “Hybrid Memristor-CMOS (MeMOS) based Logic
journal/v464/n7290/full/nature08940.html. Gates and Adder Circuits”. In: arXiv:1506.06735 [cs] (June
2015). url: http://arxiv.org/abs/1506.06735.
[2] M. Di Ventra and Y. V. Pershin. “Memcomputing: a com-
puting paradigm to store and process information on the [14] Q. Xia et al. “Memristor?CMOS Hybrid Integrated Circuits
same physical platform”. In: Nature Physics 9.4 (Apr. 2013), for Reconfigurable Logic”. In: Nano Letters 9.10 (Oct. 2009),
pp. 200–202. doi: 10 . 1038 / nphys2566. url: http : / / pp. 3640–3645. doi: 10.1021/nl901874j. url: http://dx.
arxiv.org/abs/1211.4487. doi.org/10.1021/nl901874j.
[3] S. Hamdioui et al. “Memristor based computation-in- [15] L. Guckert and E. E. Swartzlander. “Dadda Multiplier de-
memory architecture for data-intensive applications”. In: signs using memristors”. In: 2017 IEEE International Confer-
2015 Design, Automation Test in Europe Conference Exhibition ence on IC Design and Technology (ICICDT). 2017.
(DATE). May 2015, pp. 1718–1725. [16] J. Sandrini, M. Thammasack, T. Demirci, P.-E. Gaillardon,
[4] G. Snider. “Computing with hysteretic resistor crossbars”. D. Sacchetto, G. De Micheli, and Y. Leblebici. “Heteroge-
In: Applied Physics A 80.6 (Mar. 2005), pp. 1165–1172. doi: neous integration of ReRAM crossbars in 180 nm CMOS
10 . 1007 / s00339 - 004 - 3149 - 1. url: https : / / link . BEoL process”. In: 145 (Sept. 2015).
springer.com/article/10.1007/s00339- 004- 3149- [17] A. A. El-Slehdar, A. H. Fouad, and A. G. Radwan. “Memristor
1. based N-bits redundant binary adder”. In: Microelectronics
[5] L. Gao, F. Alibart, and D. B. Strukov. “Programmable Journal 46.3 (Mar. 2015), pp. 207–213. doi: 10.1016/j.mejo.
CMOS/Memristor Threshold Logic”. In: IEEE Transactions 2014.12.005. url: http://www.sciencedirect.com/
on Nanotechnology 12.2 (Mar. 2013), pp. 115–119. doi: 10. science/article/pii/S0026269214003541.
1109/TNANO.2013.2241075. [18] D. Fey. “Using the multi-bit feature of memristors for reg-
[6] L. Xie, H. A. D. Nguyen, M. Taouil, S. Hamdioui, and K. Ber- ister files in signed-digit arithmetic units”. In: Semicon-
tels. “Fast boolean logic mapped on memristor crossbar”. ductor Science and Technology 29.10 (2014), p. 104008. doi:
In: 2015 33rd IEEE International Conference on Computer Design 10 . 1088 / 0268 - 1242 / 29 / 10 / 104008. url: http : / /
(ICCD). Oct. 2015, pp. 335–342. doi: 10.1109/ICCD.2015. stacks.iop.org/0268-1242/29/i=10/a=104008.
7357122. [19] P. Junsangsri, F. Lombardi, and J. Han. “A memristor-based
[7] Y. V. Pershin and M. D. Ventra. “Neuromorphic, Digital, and TCAM (Ternary Content Addressable Memory) cell”. In:
Quantum Computation With Memory Circuit Elements”. 2014 IEEE/ACM International Symposium on Nanoscale Archi-
In: Proceedings of the IEEE 100.6 (June 2012), pp. 2071–2080. tectures (NANOARCH). July 2014, pp. 1–6. doi: 10 . 1109 /
doi: 10.1109/JPROC.2011.2166369. NANOARCH.2014.6880478.
[8] J. Borghetti, Z. Li, J. Straznicky, X. Li, D. A. A. Ohlberg, W. Wu, [20] F. Su, Z. Wang, J. Li, M. F. Chang, and Y. Liu. “Design of
D. R. Stewart, and R. S. Williams. “A hybrid nanomemris- nonvolatile processors and applications”. In: 2016 IFIP/IEEE
tor/transistor logic circuit capable of self-programming”. International Conference on Very Large Scale Integration (VLSI-
In: Proceedings of the National Academy of Sciences 106.6 (Feb. SoC). Sept. 2016, pp. 1–6. doi: 10.1109/VLSI-SoC.2016.
2009), pp. 1699–1703. doi: 10 . 1073 / pnas . 0806642106. 7753543.
url: http://www.pnas.org/content/106/6/1699. [21] S. Masui, W. Yokozeki, M. Oura, T. Ninomiya, K. Mukaida, Y.
[9] Y. V. Pershin and M. Di Ventra. “Solving mazes with mem- Takayama, and T. Teramoto. “Design and applications of fer-
ristors: a massively-parallel approach”. In: Physical Review E roelectric nonvolatile SRAM and flip-flop with unlimited
84.4 (Oct. 2011). doi: 10.1103/PhysRevE.84.046703. url: read/program cycles and stable recall”. In: Proceedings of
http://arxiv.org/abs/1103.0021. the IEEE 2003 Custom Integrated Circuits Conference, 2003. Sept.
2003, pp. 403–406. doi: 10.1109/CICC.2003.1249428.
[10] J. J. Yang, D. B. Strukov, and D. R. Stewart. “Memristive
devices for computing”. In: Nature Nanotechnology 8.1 (Jan. [22] W. Zhao, E. Belhaire, and C. Chappert. “Spin-MTJ based
2013), pp. 13–24. doi: 10 . 1038 / nnano . 2012 . 240. url: Non-volatile Flip-Flop”. In: 2007 7th IEEE Conference on Nan-
http : / / www . nature . com / nnano / journal / v8 / n1 / otechnology (IEEE NANO). Aug. 2007, pp. 399–402. doi: 10.
full/nnano.2012.240.html. 1109/NANO.2007.4601218.
[11] S. Kvatinsky, A. Kolodny, U. C. Weiser, and E. G. Friedman. [23] T. Aoki et al. “30.9 Normally-off computing with crystalline
“Memristor-based IMPLY Logic Design Procedure”. In: Pro- InGaZnO-based FPGA”. In: 2014 IEEE International Solid-State
ceedings of the 2011 IEEE 29th International Conference on Com- Circuits Conference Digest of Technical Papers (ISSCC). Feb. 2014,
puter Design. ICCD ’11. Washington, DC, USA, 2011, pp. 142– pp. 502–503. doi: 10.1109/ISSCC.2014.6757531.
147. doi: 10.1109/ICCD.2011.6081389. url: http://dx. [24] I. Kazi, P. Meinerzhagen, P. E. Gaillardon, D. Sacchetto,
doi.org/10.1109/ICCD.2011.6081389. A. Burg, and G. D. Micheli. “A ReRAM-based non-volatile
flip-flop with sub-VT read and CMOS voltage-compatible
write”. In: 2013 IEEE 11th International New Circuits and Sys-
tems Conference (NEWCAS). June 2013, pp. 1–4. doi: 10.1109/
NEWCAS.2013.6573586.
40 Technology
[25] A. Lee et al. “A ReRAM-Based Nonvolatile Flip-Flop With
Self-Write-Termination Scheme for Frequent-OFF Fast-
Wake-Up Nonvolatile Processors”. In: IEEE Journal of Solid-
State Circuits 52.8 (Aug. 2017), pp. 2194–2207. doi: 10.1109/
JSSC.2017.2700788.
[26] J. Zheng, Z. Zeng, and Y. Zhu. “Memristor-based nonvolatile
synchronous flip-flop circuits”. In: 2017 Seventh International
Conference on Information Science and Technology (ICIST). Apr.
2017, pp. 504–508. doi: 10.1109/ICIST.2017.7926812.
[27] S. Pal, V. Gupta, and A. Islam. “Variation resilient low-power
memristor-based synchronous flip-flops: design and analy-
sis”. In: Microsystem Technologies (July 2018). doi: 10.1007/
s00542-018-4044-6. url: https://doi.org/10.1007/
s00542-018-4044-6.
Technology 41
3.2.4 Neuromorphic and Neuro-Inspired computer architectures by brain-inspired architec-
Computing tures. Mapping brain-like structures and processes
into electronic substrates has recently seen a revival
Neuromorphic and neuro-inspired approaches mimic the with the availability of deep-submicron CMOS tech-
functioning of human brain (or our understanding of nology.
its functioning) to efficiently perform computations Advances in technology have successively increased
that are difficult or impractical for conventional com- our ability to emulate artificial neural networks
puter architectures [1, 2]. (ANNs) with speed and accuracy. At the same time,
our understanding of neurons in the brain has in-
Neuromorphic Computing (NMC), as developed by
creased substantially, with imaging and microprobes
Carver Mead in the late 1980s, describes the use of
contributing significantly to our understanding of
large-scale adaptive analog systems to mimic orga-
neural physiology. These advances in both technol-
nizational principles used by the nervous system.
ogy and neuroscience stimulated international re-
Originally, the main approach was to use elementary
search projects with the ultimate goal to emulate
physical phenomena of integrated electronic devices
entire (human) brains. Large programs on brain
(transistors, capacitors, . . . ) as computational primi-
research through advanced neurotechnologies have
tives [1]. In recent times, the term neuromorphic has
been launched worldwide, e.g. the U.S. BRAIN initia-
also been used to describe analog, digital, and mixed-
tive (launched in 2013 [8]), the EC flagship Human
mode analog/digital hardware and software systems
Brain Project (launched in 2013 [9]), the China Brain
that transfer aspects of structure and function from
Project (launched in 2016 [10]), or the Japanese gov-
biological substrates to electronic circuits (for per-
ernment initiated Brain/MINDS project (launched in
ception, motor control, or multisensory integration).
2016 [11]). Besides basic brain research these pro-
Today, the majority of NMC implementations is based
grams aim at developing electronic neuromorphic
on CMOS technology. Interesting alternatives are, for
machine technology that scales to biological levels.
example, oxide-based memristors, spintronics, or nan-
More simply stated it is an attempt to build a new
otubes [3, 4, 5]. Such kind of research is still in its early
kind of computer with similar form and function to
stage.
the mammalian brain. Such artificial brains would be
The basic idea of NMC is to exploit the massive par- used to build robots whose intelligence matches that
allelism of such circuits and to create low-power of mice and cats. The ultimate aim is to build technical
and fault-tolerant information-processing systems. systems that match a mammalian brain in function,
Aiming at overcoming the big challenges of deep- size, and power consumption. It should recreate 10
submicron CMOS technology (power wall, reliability, billion neurons, 100 trillion synapses, consume one
and design complexity), bio-inspiration offers alter- kilowatt (same as a small electric heater), and occupy
native ways to (embedded) artificial intelligence. The less than two litres of space [8].
challenge is to understand, design, build, and use The majority of larger more bio-realistic simulations
new architectures for nanoelectronic systems, which of brain areas are still done on High Performance
unify the best of brain-inspired information process- Supercomputer (HPS). For example, the Blue Brain
ing concepts and of nanotechnology hardware, in- Project [12] at EPFL in Switzerland deploys just from
cluding both algorithms and architectures [6]. A key the beginning HPSs for digital reconstruction and sim-
focus area in further scaling and improving of cog- ulations of the mammalian brain. The goal of the
nitive systems is decreasing the power density and Blue Brain Project (EPFL and IBM, launched in 2005):
power consumption, and overcoming the CPU/mem- “. . . is to build biologically detailed digital reconstruc-
ory bottleneck of conventional computational archi- tions and simulations of the rodent, and ultimately
tectures [7]. the human brain. The supercomputer-based recon-
structions and simulations built by the project offer a
radically new approach for understanding the multi-
Current State of CMOS-Based Neuromorphic level structure and function of the brain.” The project
Approaches uses an IBM Blue Gene supercomputer (100 TFLOPS,
10TB) with currently 8,000 CPUs to simulate ANNs (at
Large scale neuromorphic chips exist based on CMOS ion-channel level) in software [12]. The time needed
technology, replacing or complementing conventional to simulate brain areas is at least two orders of mag-
42 Technology
nitude larger than biological time scales. Based on a tion [14].
simpler (point) neuron model, the simulation could
have delivered orders of magnitude higher perfor- The number of neuromorphic systems is constantly
mance. Dedicated brain simulation machines (Neuro- increasing, but not as fast as hardware accelerators
computer) based on application specific architectures for non-spiking ANNs. Most of them are research pro-
offer faster emulations on such simpler neuron mod- totypes (e.g. the trainable neuromorphic processor
els. for fast pattern classification from the Seoul National
University (Korea) [15], or the Tianjic chip from Bei-
Closely related to the Blue Brain Project is the Hu- jing’s Tsinghua University Center for Brain Inspired
man Brain Project (HBP), a European Commission Fu- Computing Research [16]. Examples from industry are
ture and Emerging Technologies Flagship [9]. The the TrueNorth chip from IBM [17] and the Loihi chip
HBP aims to put in place a cutting-edge, ICT-based from INTEL [18]. The IBM TrueNorth chip integrates
scientific research infrastructure that will allow scien- a two-dimensional on-chip network of 4096 digital
tific and industrial researchers to advance our knowl- application-specific digital cores (64 x 64) and over
edge in the fields of neuroscience, computing and 400 Mio. bits of local on-chip memory to store indi-
brain-related medicine. The project promotes collab- vidually programmable synapses. One million individ-
oration across the globe, and is committed to driv- ually programmable neurons can be simulated time-
ing forward European industry. Within the HBP the multiplexed per chip. The chip with about 5.4 billion
subproject SP9 designs, implements and operate a transistors is fabricated in a 28nm CMOS process (4.3
Neuromorphic Computing Platform with configurable cm² die size, 240µm x 390 µm per core) and by device
Neuromorphic Computing Systems (NCS). The plat- count the largest IBM chip ever fabricated. The INTEL
form provides NCS based on physical (analogue or self-learning neuromorphic Loihi chip integrates 2.07
mixed-signal) emulations of brain models, running in billion transistors in a 60 mm² die fabricated in Intel’s
accelerated mode (NM-PM1, wafer-scale implemen- 14 nm CMOS FinFET process. The first iteration of the
tation of 384 chips with about 200.000 analog neu- Loihi houses 128 clusters of 1,024 artificial neurons
rons on a wafer in 180nm CMOS, 20 wafer in the full each for a total of 131,072 simulated neurons, up to 128
system), numerical models running in real time on a million (1-bit) synapses (16 MB), three Lakefield (Intel
digital multicore architecture (NM-MC1 with 18 ARM Quark) CPU cores, and an off-chip communication net-
cores per chip in 130nm CMOS, 48 chips per board, work. An asynchronous NoC manages the communi-
and 1200 boards for the full system), and the software cation of packetized messages between clusters. Loihi
tools necessary to design, configure and measure the is not a product, but available for research purposes
performance of these systems. The platform will be among academic research groups organized in the
tightly integrated with the High Performance Analyt- INTEL Neuromorphic Research Community (INRC).
ics and Computing Platform, which will provide essen-
tial services for mapping and routing circuits to neu-
romorphic substrates, benchmarking and simulation-
Artificial Neural Networks (ANNs)
based verification of hardware specifications [9]. For
both neuromorphic hardware systems new chip ver-
sions are under development within HBP. NM-PM2: All above mentioned projects have in common that
wafer-scale integration based on a new mixed-signal they model spiking neurons, the basic information
chip in 65nm CMOS integrating a custom SIMD pro- processing element in biological nervous systems.
cessor (32-bit, 128-bit wide vectors) for learning (6- A more abstract implementation of biological neu-
bit SRAM synapse-circuits), an analog network core ral systems are Artificial Neural Networks (ANNs).
with better precision per neuron- (10 bit resolution), Popular representatives are Deep Neural Networks
and an improved communication system [13]; NM- (DNNs) as they have propelled an evolution in the
MC2: 144 ARM M4F cores per chip in 22nm CMOS machine learning field. DNNs share some architec-
technology with floating point support, 128 KByte lo- tural features of the nervous systems, some of which
cal SRAM, and improved power management. Further- are loosely inspired by biological vision systems [19].
more, the chip provides a dedicated pseudo random DNNs are dominating computer vision today and ob-
number generator, an exponential function accelera- serve a strong growing interest for solving all kinds
tor and a Multiply-Accumulate (MAC) array (16x4 8Bit of classification, function approximation, interpola-
multiplier) with DMA for rate based ANN computa- tion, or forecasting problems. Training DNNs is com-
Technology 43
putationally intense. For example, Baidu Research5 input data with FP32 accumulation. The FP16 multiply
estimated that training one DNN for speech recogni- results in a full precision product that is then accumu-
tion can require up to 20 Exaflops (1018 floating point lated using FP32 addition with the other intermediate
operations per second); whereas Summit, the world’s products for a 4 × 4 × 4 matrix multiply [23]. The
largest supercomputers in June 2019, deliver about 148 Nvidia DGX-1 system based on the Volta V100 GPUs
Petaflops. Increasing the available computational re- was delivered in the third quarter of 2017 [24] as at that
sources enables more accurate models as well as newer time the world’s first purpose built system optimized
models for high-value problems such as autonomous for deep learning, with fully integrated hardware and
driving and to experiment with more-advanced uses software.
of artificial intelligence (AI) for digital transforma-
tion. Corporate investment in artificial intelligence Many more options for DNN hardware acceleration
will rapidly increase, becoming a $100 billion market are showing up [25]. AMD’s forthcoming Vega GPU
by 2025 [20]. should offer 13 TFLOPS of single precision, 25 TFLOPS
of half-precision performance, whereas the machine-
Hence, a variety of hardware and software solutions
learning accelerators in the Volta GPU-based Tesla
have emerged to slake the industry’s thirst for per-
V100 can offer 15 TFLOPS single precision (FP32) and
formance. The currently most well-known commer-
120 Tensor TFLOPS (FP16) for deep learning work-
cial machines targeting deep learning are the TPUs of
loads. Microsoft has been using Altera FPGAs for simi-
Google and the Nvidia Volta V100 and Turing GPUs.
lar workloads, though a performance comparison is
A tensor processing unit (TPU) is an ASIC developed
tricky; the company has performed demonstrations
by Google specifically for machine learning. The chip
of more than 1 Exa-operations per second [26]. Intel
has been specifically designed for Google’s Tensor-
offers the Xeon Phi 7200 family and IBMs TrueNorth
Flow framework. The first generation of TPUs applied
tackles deep learning as well [27]. Other chip and IP
8-bit integer MAC (multiply accumulate) operations.
(Intellectual Property) vendors—including Cadence,
It is deployed in data centres since 2015 to acceler-
Ceva, Synopsys, and Qualcomms zeroth—are touting
ate the inference phase of DNNs. An in-depth analy-
DSPs for learning algorithms. Although these hard-
sis was recently published by Jouppi et al. [21]. The
ware designs are better than CPUs, none was origi-
second generation TPU of Google, announced in May
nally developed for DNNs. Ceva’s new XM6 DSP core6
2017, are rated at 45 TFLOPS and arranged into 4-chip
enables deep learning in embedded computer vision
180 TFLOPS modules. These modules are then assem-
(CV) processors. The synthesizable intellectual prop-
bled into 256 chip pods with 11.5 PFLOPS of perfor-
erty (IP) targets self-driving cars, augmented and vir-
mance [22]. The new TPUs are optimized for both
tual reality, surveillance cameras, drones, and robotics.
training and making inferences.
The normalization, pooling, and other layers that con-
Nvidia’s Tesla V100 GPU contains 640 Tensor Cores stitute a convolutional-neural-network model run on
delivering up to 120 Tensor TFLOPS for training and the XM6’s 512-bit vector processing units (VPUs). The
inference applications. Tensor Cores and their as- new design increases the number of VPUs from two
sociated data paths are custom-designed to dramat- to three, all of which share 128 single-cycle (16 × 16)-
ically increase floating-point compute throughput bit MACs, bringing the XM6’s total MAC count to 640.
with high energy efficiency. For deep learning in- The core also includes four 32-bit scalar processing
ference, V100 Tensor Cores provide up to 6x higher units.
peak TFLOPS compared to standard FP16 operations
on Nvidia Pascal P100, which already features 16-bit Examples for start-ups are Nervana Systems7 , Knu-
FP operations [23]. path8 , Wave Computing9 , and Cerebas10 . The Nervana
Engine will combine a custom 28 nm chip with 32 GB
Matrix-Matrix multiplication operations are at the
of high bandwidth memory and replacing caches with
core of DNN training and inferencing, and are used
software-managed memory. Kupath second gener-
to multiply large matrices of input data and weights
ation DSP Hermosa is positioned for deep learning
in the connected layers of the network. Each Tensor
Core operates on a 4 × 4 matrix and performs the fol- 6 www.ceva-dsp.com
lowing operation: D = A×B+C, where A, B, C, and 7 www.nervanasys.com
8
D are 4 × 4 matrices. Tensor Cores operate on FP16 9 www.knupath.com
www.wavecomp.com
5 10
www.baidu.com www.graphcore.ai
44 Technology
as well as signal processing. The 32 nm chip con- Neuromorphic hardware development is progressing
tains 256 tiny DSP cores operation at 1 GHz along fast with a steady stream of new architectures coming
with 64 DMA engines and burns 34 W. The dataflow up. Because network models and learning algorithms
processing unit from Wave Computing implements are still developing, there is little agreement on what
“tens of thousands” of processing nodes and “massive a learning chip should actually look like. The compa-
amounts” of memory bandwidth to support Tensor- nies withheld details on the internal architecture of
Flow and similar machine-learning frameworks. The their learning accelerators. Most of the designs ap-
design uses self-timed logic that reaches speeds of pear to focus on high throughput for low-precision
up to 10 GHz. The 16 nm chip contains 16 thousand data, backed by high bandwidth memory subsystems.
independent processing elements that generate a to- The effect of low-precision on the learning result has
tal of 180 Tera 8-bit integer operations per second. not been analysed in detail yet. Recent work on low-
The Graphcore wafer-scale approach from Cerebas is precision implementations of backprop-based neural
another start-up example at the extreme end of the nets [29] suggests that between 8 and 16 bits of preci-
large spectrum of approaches. The company claim to sion can suffice for using or training DNNs with back-
have built the largest chip ever with 1.2 trillion tran- propagation. What is clear is that more precision is
sistors on a 46,225 mm² silicon (TSMC 16nm process). required during training than at inference time, and
It contains 400,000 optimized cores, 18 GB on-chip that some forms of dynamic fixed point representa-
memory, and 9PetaByte/s memory bandwidth. The tion of numbers can be used to reduce how many bits
programmable cores with local memory are optimized are required per number. Using fixed point rather
for machine learning primitives and connected with than floating point representations and using less bits
high-bandwidth and low latency connections [28]. per number reduces the hardware surface area, power
requirements, and computing time needed for per-
forming multiplications, and multiplications are the
most demanding of the operations needed to use or
Impact on Hardware for Neuromorphic and
train a modern deep network with backpropagation.
Neuro-Inspired Computing
A first standardization effort is the specification of the
Brain Floating Point (BFLOAT16) half-precision data
Creating the architectural design for NMC requires format for DNN learning [30]. Its dynamic range is
an integrative, interdisciplinary approach between the same as that of FP32, conversion between both
computer scientists, engineers, physicists, and ma- straightforward, and training results are almost the
terials scientists. NMC would be efficient in energy same as with FP32. Industry-wide adoption of BFLOAT
and space and applicable as embedded hardware ac- is expected.
celerator in mobile systems. The building blocks for
ICs and for the Brain are the same at nanoscale level:
electrons, atoms, and molecules, but their evolutions
Memristors in Neuromorphic and Neuro-Inspired
have been radically different. The fact that reliability,
Computing
low-power, reconfigurability, as well as asynchronicity
have been brought up so many times in recent con-
ferences and articles, makes it compelling that the In the long run also the memristor technology is heav-
Brain should be an inspiration at many different lev- ily discussed in literature for future neuromorphic
els, suggesting that future nano-architectures could computing. The idea, e.g. in so-called spike-time-
be neural-inspired. The fascination associated with an dependent plasticity (STDP) networks [31, 32], is to
electronic replication of the human brain has grown mimic directly the functional behaviour of a neuron.
with the persistent exponential progress of chip tech- In STDP networks the strength of a link to a cell is de-
nology. The decade 2010–2020 has also made the elec- termined by the time correlation of incoming signals
tronic implementation more feasible, because elec- to a neuron along that link and the output spikes. The
tronic circuits now perform synaptic operations such shorter the input pulses are compared to the output
as multiplication and signal communication at energy spike, the stronger the input links to the neuron are
levels of 10 fJ, comparable to biological synapses. Nev- weighted. In contrast, the longer the input signals
ertheless, an all-out assembly of 1014 synapses will lay behind the output spike, the weaker the link is
remain a matter of a few exploratory systems for the adjusted. This process of strengthening or weakening
next two decades because of several challenges [6]. the weight shall be directly mapped onto memristors
Technology 45
by increasing or decreasing their resistance depend- Perspectives on Neuromorphic and Neuro-Inspired
ing which voltage polarity is applied to the poles of Computing
a two-terminal memristive device. This direct map-
ping of an STDN network to an analogue equivalent Brain-inspired hardware computing architectures
of the biological cells to artificial memristor-based have the potential to perform AI tasks better than
neuron cells shall emerge new extreme low-energy conventional architecture by means of better perfor-
neuromorphic circuits. Besides this memristor-based mance, lower energy consumption, and higher re-
STDP networks there are lots of proposals for neural silience to defects. Neuromorphic Computing and
networks to be realised with memristor-based cross- Deep Neural Networks represent two approaches for
bar and mesh architectures for cognitive detection taking inspiration from biological brains. Software im-
and vision applications, e.g. [33]. plementations on HPC-clusters, multi-cores (OpenCV),
and GPGPUs (NVidia cuDNN) are already commercially
used. FPGA acceleration of neural networks is avail-
One extremely useful property of memristors in the able as well. From a short term perspective these
context of neuromorphic is their biorealism, i.e., the software implemented ANNs may be accelerated by
ability to mimic behavior of elements found in hu- commercial transistor-based neuromorphic chips or
man brain [34] and vision system [35]. Some of the accelerators. Future emerging hardware technologies,
early neuromorphic systems used capacitors to rep- like memcomputing and 3D stacking [39] may bring
resent weights in the analog domain [1], and memris- neuromorphic computing to a new level and overcome
tance can assume its role [34]. Well-known learning some of the restriction of Von-Neumann-based sys-
concepts, including spike-timing-dependent plasticity tems in terms of scalability, power consumption, or
(STDP), can be mapped to memristive components in performance.
a natural way [36]. A recent example of a biorealistic Particularly attractive is the application of ANNs in
hardware model is [37], which reports the manufac- those domains where, at present, humans outperform
turing of a larger-scale network of artificial memris- any currently available high-performance computer,
tive neurons and synapses capable of learning. The e.g., in areas like vision, auditory perception, or sen-
memristive functionality is achieved by precisely con- sory motor control. Neural information processing
trolling silver nanoparticles in a dielectric film such is expected to have a wide applicability in areas that
that their electrical properties closely matches ion require a high degree of flexibility and the ability to
channels in a biological neuron. operate in uncertain environments where informa-
tion usually is partial, fuzzy, or even contradictory.
This technology is not only offering potential for large
Biorealistic models are not the only application of scale neuroscience applications, but also for embed-
memristors in neuromorphic or neuro-inspired archi- ded ones: robotics, automotive, smartphones, IoT,
tectures. Such architectures realize neural networks surveillance, and other areas [6]. Neuromorphic com-
(NNs) with a vast amount of weights, which are de- puting appears as key technology on several emerging
termined, or learned, during the training phase, and technology lists. Hence, Neuromorphic technology
then used without modification for an extended pe- developments are considered as a powerful solution
riod of time, during the inference phase. After some for future advanced computing systems [40]. Neuro-
time, when the relevant conditions have changed, it morphic technology is in early stages, despite quite a
may become necessary to re-train the NN and replace number of applications appearing.
its weights by new values. Memristive NVMs are an To gain leadership in this domain there are still many
attractive, lightweight and low-power option for stor- important open questions that need urgent investiga-
ing these weights. The circuit, once trained, can be tion (e.g. scalable resource-efficient implementations,
activated and deactivated flexibly while retaining its online learning, and interpretability). There is a need
learned knowledge. A number of neuromorphic ac- to continue to mature the NMC system and at the same
celerators based on memristive NVMs have been pro- time to demonstrate the usefulness of the systems
posed in the last few years. For example, IBM devel- in applications, for industry and also for the society:
oped a neuromorphic core with a 64-K-PCM-cell as more usability and demonstrated applications.
“synaptic array” with 256 axons × 256 dendrite to im-
plement spiking neural networks [38]. More focus on technology access might be needed in
46 Technology
Europe. Regarding difficulties for NMC in EC frame- [14] Y. Yan et al. “Efficient Reward-Based Structural Plasticity
work programmes, integrated projects were well fit- on a SpiNNaker 2 Prototype”. In: IEEE Trans. on Biomedical
Circuits and Systems 13.3 (2019), pp. 579–591.
ting the needs of NMC in FP7, but are missing in H2020.
For further research on neuromorphic technology [15] J. Park et al. “A 65nm 236.5nJ/Classification Neuromorphic
Processor with 7.5% Energy Overhead On-Chip Learning
the FET-OPEN scheme could be a good path as it re- Using Direct Spike-Only Feedback”. In: IEEE Int. Solid-State
quires several disciplines (computer scientists, ma- Circuits Conference. 2019, pp. 140–141.
terial science, engineers in addition to neuroscience, [16] J. Pei et al. “Towards artificial general intelligence with
modelling). One also needs support for many small- hybrid Tianjic chip architecture”. In: Nature 572 (2019),
scale interdisciplinary exploratory projects to take p. 106.
advantage of newly coming out developments, and [17] P. A. Merolla et al. “A million spiking-neuron integrated
allow funding new generation developers having new circuit with a scalable communication network and in-
ideas. terface”. In: Science 345.6197 (2014), pp. 668–673. doi: 10.
1126/science.1254642.
[18] M. Davies et al. “A Neuromorphic Manycore Processor with
References On-Chip Learning”. In: IEEE Micro 1 (2018), pp. 82–99.
[19] Y. LeCun and Y. Bengio. In: The Handbook of Brain Theory and
[1] C. Mead. “Neuromorphic Electronic Systems”. In: Proceed- Neural Networks. 1998. Chap. Convolutional Networks for
ings of the IEEE 78.10 (Oct. 1990), pp. 1629–1636. doi: 10 . Images, Speech, and Time Series, pp. 255–258.
1109/5.58356.
[20] D. Wellers, T. Elliott, and M. Noga. 8 Ways Machine Learning
[2] W. Wen, C. Wu, X. Hu, B. Liu, T. Ho, X. Li, and Y. Chen. “An Is Improving Companies’ Work Processes. Harvard Business
EDA framework for large scale hybrid neuromorphic com- Review. 2017. url: https://hbr.org/2017/05/8-ways-
puting systems”. In: 2015 52nd ACM/EDAC/IEEE Design Au- machine- learning- is- improving- companies- work-
tomation Conference (DAC). June 2015, pp. 1–6. processes.
[3] Y. V. Pershin and M. D. Ventra. “Neuromorphic, Digital, and [21] N. P. Jouppi et al. “In-Datacenter Performance Analysis of
Quantum Computation With Memory Circuit Elements”. a Tensor Processing Unit”. In: Proceedings of the 44th Annual
In: Proceedings of the IEEE 100.6 (June 2012), pp. 2071–2080. International Symposium on Computer Architecture (ISCA). ISCA
doi: 10.1109/JPROC.2011.2166369. ’17. 2017, pp. 1–12. doi: 10.1145/3079856.3080246.
[4] M. D. Pickett, G. Medeiros-Ribeiro, and R. S. Williams. “A [22] Wikipedia. Tensor Processing Unit. 2017. url: https://en.
scalable neuristor built with Mott memristors”. In: Nature wikipedia.org/wiki/Tensor_processing_unit.
Materials 12 (Feb. 2013), pp. 114–117. doi: 10.1038/nmat
3510. [23] Nvidia Corporation. Nvidia Tesla V100 GPU Architecture. Ver-
sion WP-08608-001_v01. 2017. url: https : / / images .
[5] S. H. Jo, T. Chang, I. Ebong, B. B. Bhadviya, P. Mazumder, nvidia . com / content / volta - architecture / pdf /
and W. Lu. “Nanoscale Memristor Device as Synapse in Neu- Volta-Architecture-Whitepaper-v1.0.pdf.
romorphic Systems”. In: Nano Letters 10.4 (2010), pp. 1297–
1301. doi: 10.1021/nl904092h. [24] T. P. Morgan. Big Bang for the Buck Jump with Volta DGX-1.
2017. url: https://www.nextplatform.com/2017/05/
[6] U. Rueckert. “Brain-Inspired Architectures for Nanoelec- 19/big-bang-buck-jump-new-dgx-1/.
tronics”. In: CHIPS 2020 VOL. 2: New Vistas in Nanoelectronics.
2016, pp. 249–274. doi: 10 . 1007 / 978 - 3 - 319 - 22093 - [25] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O.
2_18. Temam. “DianNao: A Small-footprint High-throughput Ac-
celerator for Ubiquitous Machine-learning”. In: Proceedings
[7] E. Eleftheriou. “Future Non-Volatile Memories: Technology, of the 19th International Conference on Architectural Support
Trends, and Applications”. 2015. for Programming Languages and Operating Systems (ASPLOS).
[8] U.S. Brain Initiative. 2019. url: http://braininitiative. ASPLOS ’14. 2014, pp. 269–284. doi: 10 . 1145 / 2541940 .
nih.gov. 2541967.
[9] Human Brain Project. 2017. url: http://www.humanbrainp [26] P. Bright. Google brings 45 Teraflops Tensor Flow Processors to
roject.eu. its Compute Cloud. 2017. url: https://arstechnica.com/
information- technology/2017/05/google- brings-
[10] M. Poo et al. “China Brain Project: Basic Neuroscience,
45 - teraflops - tensor - flow - processors - to - its -
Brain Diseases, and Brain-Inspired Computing”. In: Neuron
compute-cloud/.
92 (Nov. 2016), pp. 591–596.
[27] L. Gwennap. “Learning Chips Hit The Market: New Archi-
[11] Japanese Brain/MIND Project. 2019. url: http://brainmind
tectures Designed for Neural Processing”. In: Microprocessor
s.jp/en/.
Report. MPR 6/27/16 (2016).
[12] The Blue Brain Project - A Swiss Brain Initiative. 2017. url:
[28] M. Demler. “CEREBRAS BREAKS THE RETICLE BARRIER:
http://bluebrain.epfl.ch/page-56882-en.html.
Wafer-Scale Engineering Enables Integration of 1.2 Trillion
[13] S. Aamir et al. “An Accelerated LIF Neuronal Network Ar- Transistors”. In: MPR (2019).
ray for a Large Scale Mixed-Signa-Neuromorphic Architec-
ture”. In: arXiv:1804.01906v3 (2018).
Technology 47
[29] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan.
“Deep Learning with Limited Numerical Precision”. In:
Proceedings of the 32nd International Conference on Interna-
tional Conference on Machine Learning (ICML). ICML’15. 2015,
pp. 1737–1746.
[30] D. Kalamkar et al. “A Study of BFLOAT16 for Deep Learning
Training”. In: arXiv:1905.12322v3 (2019).
[31] G. S. Snider. “Spike-timing-dependent learning in memris-
tive nanodevices”. In: International Symposium on Nanoscale
Architectures. 2008, pp. 85–92. doi: 10 . 1109 / NANOARCH .
2008.4585796.
[32] T. Serrano-Gotarredona, T. Masquelier, T. Prodromakis, G.
Indiveri, and B. Linares-Barranco. “STDP and STDP varia-
tions with memristors for spiking neuromorphic learning
systems”. In: Frontiers in Neuroscience 7 (2013), p. 15. doi:
10.3389/fnins.2013.00002.
[33] C. K. K. Lim, A. Gelencser, and T. Prodromakis. “Computing
Image and Motion with 3-D Memristive Grids”. In: Memris-
tor Networks. 2014, pp. 553–583. doi: 10.1007/978-3-319-
02630-5_25.
[34] I. E. Ebong and P. Mazumder. “CMOS and Memristor-Based
Neural Network Design for Position Detection”. In: Proceed-
ings of the IEEE 100.6 (2012), pp. 2050–2060.
[35] C. K. K. Lim, A. Gelencser, and T. Prodromakis. “Computing
Image and Motion with 3-D Memristive Grids”. In: Memris-
tor Networks. 2014, pp. 553–583. doi: 10.1007/978-3-319-
02630-5_25.
[36] T. Serrano-Gotarredona, T. Masquelier, T. Prodromakis, G.
Indiveri, and B. Linares-Barranco. “STDP and STDP varia-
tions with memristors for spiking neuromorphic learning
systems”. In: Frontiers in Neuroscience 7 (2013).
[37] X. Wang, S. Joshi, S. Savelév, et al. “Fully memristive neu-
ral networks for pattern classification with unsupervised
learning”. In: Nature Electronics 1.2 (2018), pp. 137–145.
[38] G. Hilson. IBM Tackles Phase-Change Memory Drift, Resistance.
EETimes. 2015. url: http://www.eetimes.com/documen
t.asp?doc_id=1326477.
[39] B. Belhadj, A. Valentian, P. Vivet, M. Duranton, L. He, and O.
Temam. “The improbable but highly appropriate marriage
of 3D stacking and neuromorphic accelerators”. In: Inter-
national Conference on Compilers, Architecture and Synthesis
for Embedded Systems (CASES). 2014, pp. 1–9. doi: 10.1145/
2656106.2656130.
[40] FET. Workshop on the Exploitation of Neuromorphic Computing
Technologies. 2017. url: https://ec.europa.eu/digita
l-single-market/en/news/workshop-exploitation-
neuromorphic-computing-technologies.
48 Technology
3.3 Applying Memristor Technology in The 1T1R11 memristor technique introduced by
Reconfigurable Hardware Tanachutiwat et al. [1] reduces the number of tran-
sistors required for one memory cell to one. Mem-
ristor based memory cells require a small encode/de-
Reconfigurable computing combines the advantages code hardware, but this technique still has an area
of programmability of software with the performance density enhancement of six times to the SRAM based
of hardware. Industry and research exploit this abilityapproach. The memristor based cells only require
for fast prototyping of hardware, update hardware in power if their content changes, reducing the static
the field or to reduce costs in environments where power consumption of reconfigurable hardware. Be-
a company only requires a small volume of chips. cause of the density improvements even more Block
Even in High Performance Computing (HPC), recon- RAM can be deployed on the reconfigurable hardware
figurable hardware plays an important role by accel- than currently available. Another important improve-
erating time consuming functions. Reconfigurable ment using memristor technology is its non-volatile
hardware is well-integrated in modern computational feature. Even if the whole reconfigurable hardware
environments, like in System on Chips (SoCs) or ad- looses power, the content of the Block RAM is still avail-
ditional accelerator cards. The most common chip able after power restoration.
types used for reconfigurable hardware are Field Pro-
grammable Gate Arrays (FPGAs). Their importance
has increased in the last years because FPGA vendors Memristors in Configurable Logic Blocks (CLBs)
like Xilinx and Intel switched to much smaller chip
fabrication processes and could double the size of the The CLBs are another important building block of
available reconfigurable hardware per chip. reconfigurable hardware because they implement
the different hardware functions. In general this is
At the moment reconfigurable hardware is produced achieved by using/ combining LUTs and/or multiplex-
in a standard CMOS fabrication process. Config- ers. Like Block RAM, LUTs are, at the moment, based
uration memory, Block-RAM, and Look-Up-Tables on SRAM cells. The 1TR1 approach of Section 3.3 is
(LUTs) are implemented using Static-Random-Access- also a simple approach to improve area density and
Memory (SRAM) cells or flash-based memory. Cross- power consumption within LUTs (see for example, Ku-
bar switches consisting of multiple transistors provide mar[2]). The non-volatile feature of memristors would
routing and communication infrastructure. improve configuration management of reconfigurable
hardware because the configuration of the hardware
CMOS compatibility and a small area and power foot- does not need to be reloaded after a power loss.
print are the important features of memristor tech-
nology for reconfigurable hardware. At the moment
the main challenges for reconfigurable hardware are Memristors in the Interconnection Network
a high static power consumption and long intercon-
nection delays. Memristor technology, applied to im- The interconnection network of reconfigurable hard-
portant building blocks of reconfigurable hardware, ware is responsible for 50%-90% of the total recon-
can help overcoming these challenges. figurable hardware area usage, 70%-80% of the total
signal delay and 60%-85% of the total power consump-
The following subsections describe the impact of mem- tion[3]. Improving the interconnection network will
ristor technology to key parts of reconfigurable hard- have a huge impact on the overall reconfigurable hard-
ware. ware performance. Routing resources of the intercon-
nection network are implemented using seven CMOS
transistors at the moment. Six transistors for a SRAM
cell and one transistor for controlling the path.
Memristors in Block RAM
Tanachutiwat etal. [1] extend their 1TR1 approach
for Block RAM cells to a 2T1R and 2T2R technique for
Block RAM is the most obvious part of reconfigurable
routing switches. The second is fully compatible to
hardware for the deployment of memristor technol-
the current implementation because one transistor
ogy. Current Block RAM is SRAM based and one SRAM
cell consists of six CMOS transistors. 11
1 Transistor Element and 1 Resistive Element
Technology 49
controls the path, while in the 2T1R technique a mem- References
ristor does. The 2T1R and 2T2R approach is also used [1] S. Tanachutiwat, M. Liu, and W. Wang. “FPGA Based on In-
by Hasan etal. [4] to build complex crossbar switches. tegration of CMOS and RRAM”. In: IEEE Transactions on Very
A complex routing switch built out of many 2T1R or Large Scale Integration (VLSI) Systems 19.11 (2011), pp. 2023–
2R2R elements can save even more transistors by com- 2032.
bining different programming transistors. [2] T. Nandha Kumar. “An Overview on Memristor-Based Non-
volatile LUT of an FPGA”. In: Frontiers in Electronic Technolo-
The memristor based improvements for the intercon- gies: Trends and Challenges. Singapore, 2017, pp. 117–132.
nection network reduce the standby power of recon- doi: 10 . 1007 / 978 - 981 - 10 - 4235 - 5 _ 8. url: https :
figurable hardware considerably. They also reduce //doi.org/10.1007/978-981-10-4235-5_8.
the area requirements for the interconnection net- [3] J. Cong and B. Xiao. “FPGA-RPI: A Novel FPGA Architecture
With RRAM-Based Programmable Interconnects”. In: IEEE
work, allowing a more dense placement of the logic
Transactions on Very Large Scale Integration (VLSI) Systems
blocks and, therefore, improving the overall signal 22.4 (Apr. 2014), pp. 864–877. doi: 10.1109/TVLSI.2013.
delay. Like in the previous sections, the non-volatile 2259512.
nature of the memristors prevents configuration loss [4] R. Hasan and T. M. Taha. “Memristor Crossbar Based Pro-
on power disconnect. grammable Interconnects”. In: 2014 IEEE Computer Society
Annual Symposium on VLSI. July 2014, pp. 94–99. doi: 10 .
1109/ISVLSI.2014.100.
Conclusion and Research Perspective
50 Technology
3.4 Non-Silicon-based Technology Manufacturing of passive optical modules (e.g. waveg-
uides, splitters, crossings, microrings) is relatively
compatible with CMOS process and the typical cross-
3.4.1 Photonics
section of a waveguide (about 500 nm) is not critical,
unless for the smoothness of the waveguide walls as to
The general idea of using photonics in computing sys-
keep light scattering small. Turns with curvature of a
tems is to replace electrons with photons in intra-
few µm and exposing limited insertion loss are possi-
chip, inter-chip, processor-to-memory connections
ble, as well as grating couplers to introduce/emit light
and maybe even logic.
from/into a fiber outside of the chip. Even various 5x5
optical switches [7] can be manufactured out of ba-
Introduction to Photonics and Integrated Photonics sic photonic switching elements relying on tunable
micro-ring resonators. Combining these optical mod-
ules, various optical interconnection topologies and
An optical transmission link is composed by some key
schemes can be devised: from all-to-all contention-
modules: laser light source, a modulator that con-
less networks up to arbitrated ones which share opti-
verts electronic signals into optical ones, waveguides
cal resources among different possible paths.
and other passive modules (e.g. couplers, photonic
switching elements, splitters) along the link, a possi- In practice, WDM requires precision in microring
ble drop filter to steer light towards the destination manufacturing, runtime tuning (e.g. thermal), align-
and a photodetector to revert the signal into the elec- ment (multiple microrings with the same resonant
tronic domain. The term integrated photonics refers frequency) and make more complex both the manage-
to a photonic interconnection where at least some of ment of multi-wavelength light from generation, dis-
the involved modules are integrated into silicon [1]. tribution, modulation, steering up to photo-detection.
Also directly modulated integrated laser sources have The more complex a topology, the more modules can
been developed and are improving at a steady pace [2, be found along the possible paths between source and
3, 4, 5]. Active components (lasers, modulators and destination, on- and off-chip, and more laser power
photodetectors) cannot be trivially implemented in is needed to compensate their attenuation and meet
CMOS process as they require the presence of materi- the sensitivity of the detector. For these reasons, rela-
als (e.g., III-V semiconductors) different from silicon tively simple topologies can be preferable as to limit
and, typically, not naturally compatible with it in the power consumption and, spatial division multiplex-
production process. However, great improvements ing (using multiple parallel waveguides) can allow to
have been done in the last years on this subject. trade WDM for space occupation.
Optical communication nowadays features about Optical inter-chip signals are then expected to be con-
10-50 GHz modulation frequency and can support veyed also on different mediums to facilitate integra-
wavelength-division-multiplexing (WDM) up to 100+ bility with CMOS process, e.g., polycarbonate as in
colors in fiber and 10+ (and more are expected in near some IBM research prototypes and commercial solu-
future) in silicon. Modulations are investigated too, tions.
as to boost bandwidth per laser color in the 100s GBps
zone, like in [6]. Propagation loss is relatively small in
silicon and polymer materials so that optical commu- Current Status and Roadmaps
nication can be regarded as substantially insensitive
to chip- and board-level distances. Where fiber can Currently, optical communication is mainly used in
be employed (e.g. rack- and data centre levels) attenu- HPC systems in the form of optical cables which have
ation is no problem. Optical communication can rely progressively substituted shorter and shorter elec-
on extremely fast signal propagation speed (head-flit tronic links. From 10+ meters inter-rack communi-
latency): around 15 ps/mm in silicon and about 5.2 cation down to 1+ meter intra-rack and sub meter
ps/mm in polymer waveguides that is traversing a intra-blade links.
2 cm x 2 cm chip corner-to-corner in 0.6 and 0.2 ns, re-
spectively. However, conversions to/from the optical A number of industrial and research roadmaps are
domain can erode some of this intrinsic low-latency, projecting and expecting this trend to arrive within
as it is the case for network-level protocols and shared boards and then to have optical technology that
resource management. crosses the chip boundary, connects chips within
Technology 51
Table 3.2: Expected evolution of optical interconnection [8].
Time Frame ~2000 ~2005 ~2010 ~2015 ~2020 ~2025
Interconnect Rack Chassis Backplane Board Module Chip
Reach 20 – 100 m 2 – 4m 1 – 2m 0.1 – 1 m 1 – 10 cm 0.1 – 3 cm
Bandw. (Gb/s, 40 – 200 G 20 – 100 G 100 – 400 G 0.3 – 1 T 1 – 4T 2 – 20 T
Tb/s)
Bandw. ~100 ~100 – 400 ~400 ~1250 > 10000 > 40000
Density
(GB/s/cm2 )
Energy 1000 → 200 400 → 50 100 → 25 25 → 5 1 → 0.1 0.1 → 0.01
(pJ/bit)
silicon- and then in optical-interposers and eventu- getting closer to the chips, leveraging more and more
ally arriving to a complete integration of optics on a photonics in an integrated form. Packaging problem
different layer of traditional chips. For this reason, of photonics remains a major issue, especially where
also the evolution of 2.5 - 3D stacking technologies optical signals need to traverse the chip package. Also
is expected to enable and sustain this roadmap up to for these reasons, interposers (silicon and optical) ap-
seamless integration of optical layers along with logic pear to be the reasonable first steps towards optically
ones, and the dawn of disaggregated architectures en- integrated chips. Then, full 3D processing and hybrid
abled by the low-latency features of optics [9]. The material integration are expected from the process
expected rated performance/consumption/density point of view.
metrics are shown in the 2016 Integrated Photonic
Systems Roadmap [8] (see Table 3.2). Figure 3.6 underlines the expected adoption roadmap
of the different levels of adoption of optical technolo-
IBM, HPM, Intel, STM, CEA–LETI, Imec and Petra, gies, published in the 2017 Integrated Photonic Sys-
to cite a few, essentially share a similar view on tems Roadmap. In particular, from the interconnect,
this roadmap and on the steps to increase band- packaging and photonic integration standpoints. The
width density, power consumption and cost effective- expected evolution of laser sources over time is con-
ness of the interconnections needed in the Exascale, firmed as well as interposer-based solutions will pave
and post-Exascale HPC systems. For instance, Petra the way to full integrated ones.
labs demonstrated the first optical silicon interposer
Conversion from photons to electrons is costly and
prototype [10] in 2013 featuring 30 TB/s/cm2 band-
for this reason there are currently strong efforts in
width density and in 2016 they improved consump-
improving the crucial physical modules of an inte-
tion and high-temperature operation of the optical
grated optical channel (e.g. modulators, photodetec-
modules [11]. HP has announced the Machine system
tors and thermally stable and efficiently integrated
which relies on the optical X1 photonic module ca-
laser sources).
pable of 1.5 Tbps over 50m and 0.25 Tbps over 50km.
Intel has announced the Omni-Path Interconnect Ar-
chitecture that will provide a migration path between
Alternate and Emerging Technologies Around
Cu and Fiber for future HPC/Data Centre interconnec-
Photonics
tions. Optical thunderbolt and optical PCI Express
by Intel are other examples of optical cable solutions.
IBM is shipping polymer + micro-pod optical intercon- Photonics is in considerable evolution, driven by in-
nection within HPC blades since 2012 and it is moving novations in existing components (e.g. lasers, modu-
towards module-to-module integration. lators and photodetectors) in order to push their fea-
tures and applicability (e.g. high-temperature lasers).
The main indications from current roadmaps and Consequently, its expected potential is a moving tar-
trends can be summarized as follows. Optical-cables get based on the progress in the rated features of the
(AOC - Active Optical Cables) are evolving in capability various modules. At the same time, some additional
(bandwidth, integration and consumption) and are variations, techniques and approaches at the physical
52 Technology
Figure 3.6: Integrated Photonic Systems Roadmap, 2017. Adoption expectations of the different optical
technologies.
level of the photonic domain are being investigated • Optics computing: Optalysys project12 for com-
and could potentially create further discontinuities puting in the optical domain mapping informa-
and opportunities in the adoption of photonics in com- tion onto light properties and elaborating the
puting systems. For instance, we cite here a few: latter directly in optics in an extremely energy
efficient way compared to traditional comput-
• Mode division multiplexing [12]: where light ers [18]. This approach cannot suit every applica-
propagates within a group of waveguides in par- tion but a number of algorithms, like linear and
allel. This poses some criticalities but could al- convolution-like computations (e.g. FFT, deriva-
low to scale parallelism more easily than WDM tives and correlation pattern matching), are nat-
and/or be an orthogonal source of optical band- urally compatible [19]. Furthermore, also bioin-
width; formatics sequence alignment algorithms have
been recently demonstrated feasible. Optalysis
• Free-air propagation: there are proposals to ex-
has recently announced a commercial processor
ploit light propagation within the chip package
programmable either through a specific API or
without waveguides to efficiently support some
via TensorFlow interface to implement convolu-
interesting communication pattern (e.g. fast sig-
tional neural networks (CNN) [20].
naling) [13];
Technology 53
individual core needs so that "average" behaviours will appear in future HPC designs as it is expected to
emerge. be in order to meet performance/watt objectives, it
is highly likely that for the reasons above, photonic
For instance, inter-socket or intra-processor coher-
interconnections will require to be co-designed in in-
ence and synchronizations have been designed and
tegration also with the whole heterogeneous HPC ar-
tuned in decades for the electronic technology and,
chitecture.
maybe, need to be optimized, or re-though, to take
the maximum advantage from the emerging photonic
technology. Funding Opportunities
Research groups and companies are progressing to-
wards inter-chip interposer solutions and completely Photonic technology at the physical and module level
optical chips. In this direction researchers have al- has been quite well funded in H2020 program [28]
ready identified the crucial importance of a vertical cross- as it has been regarded as strategic by the EU since
layer design of a computer system endowed with inte- years. For instance Photonics21 [29] initiative gather
grated photonics. A number of studies have already groups and researchers from a number of enabling
proposed various kinds of on-chip and inter-chip op- disciplines for the wider adoption of photonics in gen-
tical networks designed around the specific traffic eral and specifically also integrated photonics. Very
patterns of the cores and processing chips [21, 22, 23, recently, September 2019, Photonics21 has announced
24, 25, 26, 27]. a request to EU for a doubling of the budget from 100
million€ to 200 million€per year, or 1.4 billion€ over
These studies suggest also that further challenges will the course of the next research funding initiative. Typ-
arise from inter-layer design interference, i.e. lower- ically, funding instruments and calls focus on basic
layer design choices (e.g. WDM, physical topology, technologies and specific modules and in some cases
access strategies, sharing of resources) can have a sig- towards point-to-point links as a final objective (e.g.
nificant impact in higher layers of the design (e.g. NoC- optical cables).
wise and up to memory coherence and programming
model implications) and vice versa. This is mainly due Conversely, as photonics is coming close to the pro-
to the scarce experience in using photonics technol- cessing cores, which expose quite different traffic be-
ogy for serving computing needs (close to processing haviour and communication requirements compared
cores requirements) and, most of all, due to the intrin- to larger scale interconnections (e.g. inter-rack or
sic end-to-end nature of an efficient optical channel, wide-area), it is highly advisable to promote also a sep-
which is conceptually opposed to the well-established arate funding program for investigating the specific
and mature know-how of “store-and-forward” elec- issues and solutions for the adoption of integrated
tronic communication paradigm. Furthermore, the photonics at the inter-chip and intra-chip scale in
quick evolution of optical modules and the arrival of order to expose photonic technologies with the con-
discontinuities in their development hamper the con- straints coming from the actual traffic generated by
solidation of layered design practices. the processing cores and other on-chip architectural
modules. In fact, the market is getting close to the
Lastly, intrinsic low-latency properties of optical in- cores from the outside with an optical cable model that
terconnection (on-chip and inter-chip) could imply will be less and less suitable to serve the traffic as
a re-definition of what is local in a future computing the communication distance decreases. Therefore,
system, at various scales, and specifically in a perspec- now could be just the right time to invest into chip-to-
tive HPC system, as it has already partially happened chip and intra-chip optical network research in order
within the HP Machine. These revised locality features to be prepared to apply it effectively when current
will then require modifications in the programming roadmaps expect optics to arrive there.
paradigms as to enable them to take advantage of the
different organization of future HPC machines. On
this point, also resource disaggregation is regarded References
as another dimension that could be soon added to [1] IBM. Silicon Integrated Nanophotonics. 2017. url: http://
the design of future systems and, in particular, HPC researcher.watson.ibm.com/researcher/view%5C_
systems [9, 15]. Then, from another perspective, if group.php?id=2757.
other emerging technologies (e.g. NVM, in-memory
computation, approximate, quantum computing, etc.)
54 Technology
[2] Y. Kurosaka, K. Hirose, T. Sugiyama, Y. Takiguchi, and Y. [15] N. Pleros. “Silicon Photonics and Plasmonics towards
Nomoto. “Phase-modulating lasers toward on-chip inte- Network-on-Chip Functionalities for Disaggregated Com-
gration”. In: Nature - scientific reports 26.30138 (July 2016). puting”. In: Optical Fiber Communication Conference. 2018,
doi: 10.1038/srep30138. Tu3F.4. doi: 10 . 1364 / OFC . 2018 . Tu3F . 4. url: http :
[3] H. Nishi et al. “Monolithic Integration of an 8-channel Di- //www.osapublishing.org/abstract.cfm?URI=OFC-
rectly Modulated Membrane-laser Array and a SiN AWG 2018-Tu3F.4.
Filter on Si”. In: Optical Fiber Communication Conference. 2018, [16] K. L. Tsakmakidis et al. “Breaking Lorentz Reciprocity to
Th3B.2. doi: 10 . 1364 / OFC . 2018 . Th3B . 2. url: http : Overcome the Time-Bandwidth Limit in Physics and Engi-
//www.osapublishing.org/abstract.cfm?URI=OFC- neering”. In: Science 356.6344 (2017), pp. 1260–1264. doi:
2018-Th3B.2. 10.1126/science.aam6662.
[4] Z. Li, D. Lu, Y. He, F. Meng, X. Zhou, and J. Pan. “InP-based [17] C. Ríos, M. Stegmaier, P. Hosseini, D. Wang, T. Scherer, C. D.
directly modulated monolithic integrated few-mode trans- Wright, H. Bhaskaran, and W. H. P. Pernice. “Integrated
mitter”. In: Photon. Res. 6.5 (May 2018), pp. 463–467. doi: 10. all-photonic non-volatile multi-level memory”. In: Nature
1364/PRJ.6.000463. url: http://www.osapublishing. Photonics 9 (11 2015), pp. 725–732. doi: 10.1038/nphoton.
org/prj/abstract.cfm?URI=prj-6-5-463. 2015.182.
[5] S. Matsuo and K. Takeda. “λ-Scale Embedded Active Region [18] Wikipedia. Optical Computing. 2017. url: http://en.wiki
Photonic Crystal (LEAP) Lasers for Optical Interconnects”. pedia.org/wiki/Optical%5C_computing.
In: Photonics 6 (July 2019), p. 82. doi: 10.3390/photonics [19] Optalysys. Optalysys Prototype Proves Optical Processing Tech-
6030082. nology Will Revolutionise Big Data Analysis and computational
[6] C. Prodaniuc, N. Stojanovic, C. Xie, Z. Liang, J. Wei, and Fluid Dynamics. 2017. url: http://www.optalysys.com/
R. Llorente. “3-Dimensional PAM-8 modulation for 200 optalysys-prototype-proves-optical-processing-
Gbps/lambda optical systems”. In: Optics Communications technology - will - revolutionise - big - data -
435 (2019), pp. 1–4. doi: https : / / doi . org / 10 . analysis-computational-fluid-dynamics-cfd.
1016 / j . optcom . 2018 . 10 . 046. url: http : / / [20] Optalysys Rolls Commercial Optical Processor. HPC Wire. Mar.
www . sciencedirect . com / science / article / pii /
2019. url: https://www.hpcwire.com/2019/03/07/
S0030401818309210.
optalysys-rolls-commercial-optical-processor/.
[7] H. Gu, K. H. Mo, J. Xu, and W. Zhang. “A Low-power Low-cost
[21] Y. Pan, J. Kim, and G. Memik. “FlexiShare: Channel sharing
Optical Router for Optical Networks-on-Chip in Multipro-
for an energy-efficient nanophotonic crossbar”. In: Pro-
cessor Systems-on-Chip”. In: IEEE Computer Society Annual ceedings of the International Symposium on High-Performance
Symposium on VLSI. May 2009, pp. 19–24. doi: 10 . 1109 / Computer Architecture (HPCA). 2010, pp. 1–12. doi: 10.1109/
ISVLSI.2009.19. HPCA.2010.5416626.
[8] 2016 Integrated Photonic Systems Roadmap. 2017. url: http: [22] D. Vantrease et al. “Corona: System Implications of Emerg-
/ / photonicsmanufacturing . org / 2016 - integrated - ing Nanophotonic Technology”. In: SIGARCH Comput. Archit.
photonic-systems-roadmap. News 36.3 (June 2008), pp. 153–164. doi: 10.1145/1394608.
[9] K. Bergman. “Flexibly Scalable High Performance Architec- 1382135.
tures with Embedded Photonics - Keynote”. In: Platform for [23] Y. Pan, P. Kumar, J. Kim, G. Memik, Y. Zhang, and A. Choud-
Advanced Scientific Computing (PASC) Conference. 2019. hary. “Firefly: Illuminating Future Network-on-chip with
[10] Y. Urino, T. Usuki, J. Fujikata, M. Ishizaka, K. Yamada, T. Nanophotonics”. In: SIGARCH Comput. Archit. News 37.3 (June
Horikawa, T. Nakamura, and Y. Arakawa. “Fully Integrated 2009), pp. 429–440. doi: 10.1145/1555815.1555808.
Silicon Optical Interposers with High Bandwidth Density”. [24] M. Petracca, B. G. Lee, K. Bergman, and L. P. Carloni. “Design
In: Advanced Photonics for Communications. 2014. doi: 10 . Exploration of Optical Interconnection Networks for Chip
1364/IPRSN.2014.IM2A.5. Multiprocessors”. In: Proceedings of the Symposium on High
[11] Y. Urino, T. Nakamura, and Y. Arakawa. “Silicon Optical Performance Interconnects (HOTI). 2008, pp. 31–40. doi: 10.
Interposers for High-Density Optical Interconnects”. In: 1109/HOTI.2008.20.
Silicon Photonics III: Systems and Applications. 2016, pp. 1–39.
[25] P. Grani and S. Bartolini. “Design Options for Optical Ring
doi: 10.1007/978-3-642-10503-6_1.
Interconnect in Future Client Devices”. In: Journal on Emerg-
[12] H. Huang et al. “Mode Division Multiplexing Using an Or- ing Technologies in Computing Systems (JETC) 10.4 (June 2014),
bital Angular Momentum Mode Sorter and MIMO-DSP Over 30:1–30:25. doi: 10.1145/2602155.
a Graded-Index Few-Mode Optical Fibre”. In: Scientific Re- [26] I. O’Connor, D. Van Thourhout, and A. Scandurra. “Wave-
ports 5 (2015), pp. 2045–2322. doi: 10.1038/srep14931. length Division Multiplexed Photonic Layer on CMOS”. In:
[13] A. Malik and P. Singh. “Free Space Optics: Current Appli- Proceedings of the 2012 Interconnection Network Architecture:
cations and Future Challenges”. In: International Journal of On-Chip, Multi-Chip Workshop (INA-OCMC). 2012, pp. 33–36.
Optics 2015 (2015). doi: 10.1145/2107763.2107772.
[14] L. Gao, Y. Huo, K. Zang, S. Paik, Y. Chen, J. S. Harris, and Z. [27] P. Grani, R. Hendry, S. Bartolini, and K. Bergman. “Boost-
Zhou. “On-chip plasmonic waveguide optical waveplate”. ing multi-socket cache-coherency with low-latency silicon
In: Scientific reports 5 (2015). photonic interconnects”. In: 2015 International Conference on
Computing, Networking and Communications (ICNC). Feb. 2015,
pp. 830–836. doi: 10.1109/ICCNC.2015.7069453.
Technology 55
[28] Horizon 2020 EU Framework Programme. Photonics. 2017.
url: https : / / ec . europa . eu / programmes / horizon
2020/en/h2020-section/photonics.
[29] Photonics21. The European Technology Platform Photonics21.
2017. url: http://www.photonics21.org.
56 Technology
3.4.2 Quantum Computing What has been achieved is the definition of the cu-
bic schematic as given in Fig. 3.7 which is one of the
Overall View first if not the only structured view on what the differ-
ent components and system layers are of a quantum
In the Quantum Technologies Flagship Final Report, computer. This particular view is still the main driver
the following overall Quantum Computing objective is of e.g. the collaboration that QuTech in Delft has with
formulated: The goal of quantum computing is to comple- the US company Intel.
ment and outperform classical computers by solving some
computational problems more quickly than the best known More recently, the semiconductor industry has ex-
or the best achievable classical schemes. Current applica- pressed interest in the development of the qubit pro-
tions include factoring, machine learning, but more and cessor. The underlying quantum technology has ma-
more applications are being discovered. Research focuses tured such that it is now reaching the phase where
both on quantum hardware and quantum software – on large and established industrial players are becoming
building and investigating universal quantum computers, increasingly interested and ambitious to be among
and on operating them, once scaled up, in a fault-tolerant the first to deliver a usable and realistic quantum plat-
way. The defined quantum milestones are: form.
Technology 57
(a) Experimental Full-Stack with realistic qubits. (b) Simulated Full-Stack with Perfect Qubits.
develop, also defining a road map for enabling quan- from pure hardware oriented activities such as mate-
tum computation and quantum-accelerated applica- rial science, circuit design and hardware architecture
tions in Europe and worldwide for the next decade. to more software fields such as operating system and
compiler construction to high level programming lev-
Semi-conductor or related companies such as IBM, In- els and algorithm design. So educating a quantum
tel and Microsoft are increasingly known for their ac- engineer involves many disciplines and goes beyond
tivities in quantum computing but also players such as understanding quantum physics.
Google and Rigetti are becoming very present, and the
Canadian company D-WAVE is a well-known and es- Building a new kind of computer is a very multidis-
tablished player in this field. Also national initiatives ciplinary task that spans fields ranging from micro-
in, e.g. China are becoming increasingly important. electronics up to computer science. Computer En-
This has two main consequences. gineering is the connection between computer sci-
ence and microelectronics. Computer Engineering is
First, the companies involved are racing against each also defined as hardware-software co-design which
other to find ways to make a realistically sized and basically means deciding about what should be imple-
stable quantum computer. What technology will ulti- mented in hardware and what will stay software.
mately be selected for making the quantum processor
is still unknown but the set of candidates has been In conventional CMOS technology, this means that
reduced to basing the qubit development on impu- a processor architecture is defined consisting of e.g.
rities in diamonds for Nitrogen Vacancy (NV) Cen- the instruction set and the corresponding micro-
tres, the use of semiconducting or superconducting architecture. In the context of quantum computing,
qubits, the adoption of Ion traps, quantum annealing computer engineering then ranges from quantum
such as the D-WAVE machine, or an even more futuris- physics up to computer science. When the request
tic approach involving Majorana-based qubits which came in Delft to be involved in the creation of a Quan-
is best described as topological quantum computing. tum Computer, the first literature excursions immedi-
Each technology has advantages and disadvantages ately taught us that there is no scientific research that
and which one will win is still an open issue. is being done on what it means to build such a machine.
In the first conversations with the physicists, the term
The second major consequence is that companies, ’Quantum Architecture’ was very frequently used until
countries as well as universities see the increasing the computer engineers asked what their definition is
need for training quantum engineers, people capable of that term. The answer was amazingly simple even
of designing both the quantum device as well as the though their research was extremely complex: a quan-
complete system design and related tools to build a tum architecture is a 2D layout of qubits that can be
quantum computer. Just like making a digital com- addressed and controlled individually. Therefore, we
puter, there are different disciplines involved ranging lack a definition of a system architecture for quantum
58 Technology
computers. Such a system architecture is an essential a complete quantum computer involves more than
component for both establishing a common interface just building quantum devices. Whereas physicists
for the different projects on quantum computing and are mostly working at the quantum chip layer try-
defining structured contents for training engineers ing to improve the coherence of the qubits and the
in this discipline. fidelity of the gates, as well as to increase the num-
ber of qubits that can be controlled and entangled,
As shown in Figure 3.7, we focus more on the levels computer and electronic engineers are responsible
above the quantum physical chip, which is a necessary, for the development of the infrastructure required for
but not sufficient component of building a quantum building such a quantum system. As we will expand
computer. This layered stack defines the research in this proposal, the combination of the classical with
roadmap and layers that need to be developed when the quantum logic is needed when investigating and
building a quantum computer, going from a high-level building a full quantum computer. Following Figure
description of a quantum algorithm to the actual phys- 3.7, we briefly describe in the following the different
ical operations on the quantum processor. Quantum abstraction layers on which it is focusing its research
algorithms [1] are described by high-level quantum and contributions, starting at the application layer
programming languages [2, 3, 4]. Such algorithm de- and going down to the micro-architectural one. The
scription is agnostic to the faulty quantum hardware ongoing research does not focus on the quantum-to-
and assumes that both qubits and quantum opera- classical layer nor on the quantum physics in making
tions are perfect. In the compilation layer, quantum good qubits. However, explicit links with (those layers
algorithms are converted to their fault tolerant (FT) and) quantum physics and control electronics projects
version based on a specific quantum error correction will be established and described later.
code such as surface code [5] or color codes [6] and
compiled into a series of instructions that belong to Important in this context is the definition and imple-
the quantum instruction set architecture (QISA). The mentation of an open-standard quantum computer
micro-architecture layer contains components that system design, which has the overall architecture and
focus on quantum execution (QEX) and parts that are fault tolerance required to solve problems arising in
required for quantum error correction (QEC) which the computational science domain, and which are of
together are responsible for the execution of quan- practical interest for the different industries such as
tum operations and for the detection and correction aerospace and medicine. The overall design will be
of errors [7]. We will extend the micro-architecture detailed and implemented both in an experimental
already developed for the 5 qubit superconducting way using physical quantum devices as well as large
processor [8] for supporting larger number of qubits scale simulation models.
and error correction feedback. It is in these layers,
where quantum instructions are translated into the • OBJ 1 - Full Stack System Design: design and
actual pulses that are sent through the classical-to- develop a full stack quantum system that in-
quantum interface to the quantum chip. tegrates all scientific and technological results
from different fields, ranging from algorithms up
The consortium’s guiding vision is that quantum com- to the quantum processors. This will allow poten-
puting will be delivered through the union of the clas- tial users of that device to easily express quantum
sical computer with quantum technology through the algorithms using an appropriate programming
integration of a quantum computing device as an ac- language. This full-stack approach defines and
celerator of the general-purpose processors. All accel- implements the bridge between the qubit device
erators are required to define a model of computation and the user application-driven world.
and provide the appropriate consistency in order to
support a software ecosystem. As with the semicon- • OBJ 2 - Scalable Architecture: provide an open
ductor device, there are many material implementa- and available simulation platform for design
tions of a qubit and as such the Q-Machine will define space exploration of the quantum computer ar-
a micro-architecture through which different qubit chitecture as well as an instrument to advance
devices can be integrated into the machine and de- application development. It will allow to control
liver their value across all applications that utilize the large number of qubits (over 1000 qubits), and
Q-Machine Architecture. will include fault-tolerance mechanisms, map-
ping of quantum circuits and routing of quantum
From this discussion, one can also realize that building states.
Technology 59
• OBJ 3 - Societal Relevant Quantum Appli- based on the ongoing QuTech research on these top-
cations: provide relevant quantum search al- ics. The Delft team has already demonstrated the use
gorithms for DNA analysis, quantum linear of a micro-architecture for both the superconducting
solvers, and quantum-accelerated optimization as well as the semiconducting qubits. They have also
algorithms with immediate use in the medicine developed OpenQL and are instrumental in standard-
and aerospace domains. We also provide associ- ising the Quantum Assembly Language which allows
ated benchmarking capability to evaluate quan- to express quantum logic in the quantum instructions
tum computers as they evolve to a level of prac- which can be executed. The notion of micro-code
tical industrial interest. generation has also been introduced as part of the
micro-architecture.
Any computational platform needs a system design
that ties together all layers going from algorithm
to physical execution. We will investigate, develop Quantum Genome Sequencing – a Quantum
and implement an architecture for a quantum com- Accelerator Application
puter, the Q-Machine, that includes and integrates
algorithms and applications, programming languages, Since one of the first papers about quantum comput-
compilers and run-time, instruction set architecture ing by R. Feynman in 1982 [9], research on quantum
and micro-architecture. The Q-Machine shall be re- computing has focused on the development of low-
alised in two implementations: the first is a simulator level quantum hardware components like supercon-
that will use the QX-simulator as a back-end to per- ducting qubits, ion trap qubits or spin-qubits. The
form and validate large-scale and fault-tolerant quan- design of proof-of-concept quantum algorithms and
tum computing. The second implementation is that of their analysis with respect to their theoretical com-
most likely two physical machines that will utilize one plexity improvements over classical algorithms has
of the latest cryogenic qubit devices, superconduct- also received some attention. A true quantum killer
ing qubits or silicon spin qubits, to verify the results application that demonstrates the exponential perfor-
of this research (up to TRL7). These machines will mance increase of quantum over conventional com-
each support the new quantum architecture and cor- puters in practice is, however, still missing but is ur-
responding processors on which applications can be gently needed to convince quantum sceptics about
run. These devices adopt and enhance the software the usefulness of quantum computing and to make it a
development stack developed by Delft University of mainstream technology within the coming 10 years.
Technology, as well as the existing experimental quan-
tum demonstrator. Genomics concerns the application of DNA sequenc-
ing methods and bioinformatics algorithms to under-
The development and realisation of a new computing stand the structure and function of the genome of an
paradigm, such as the Q-Machine, requires partici- organism. This discipline has revealed insights with
pation, expertise and the capabilities from a broad scientific and clinical significance, such as the causes
consortium of different disciplines. To address the that drive cancer progression, as well as the intra-
objectives of Quanto and the associated domains of genomic processes which greatly influence evolution.
involvement and abstraction, it is necessary to assem- Other practical benefits include enhancing food qual-
ble the Quanto consortium with participation and ex- ity and quantity from plants and animals. An exciting
pertise ranging from the end customer, in this case prospect is personalised medicine, in which accurate
represented by the aerospace and genome industry diagnostic testing can identify patients who can bene-
and HPC, the computer science and mathematics com- fit from targeted therapies [10].
munity as well as the classical semiconductor industry
Such rapid progress in genomics is based on exponen-
with computer and systems engineering experts.
tial advances in the capability of sequencing technol-
Short term vision - Quantum Computer Engineering ogy, as shown in Figure 3.8. However, to keep up with
is a very new field which is only in the very first phase these advances, which outpace Moore’s Law, we need
of its existence. Computer architecture research on to address new computational challenges of efficiently
developing a gate-based quantum computer is basi- analysing and storing the vast quantities of genomics
cally not existing and QuTech is one of the few places data. Despite the continual development of tools to
on earth where this line of research is actively pur- process genomic data, current approaches are still
sued. The work planned for this project is heavily yet to meet the requirements for large-scale clinical
60 Technology
genomics [11]. In this case, patient turnaround time, ries of instructions, expressed in a quantum Assembly
ease-of-use, resilient operation and running costs are language QASM, that belong to the defined instruction
critical. set architecture. Note that the architectural hetero-
geneity where classical processors are combined with
different accelerators such as the quantum accelera-
tor, imposes a specific compiler structure where the
different instruction sets are targeted and ultimately
combined in one binary file which will be executed.
Quantum Programming Languages and Compilers We will also investigate classical reliability mech-
anisms throughout the stack from the application
and the compiler layer all the way into the micro-
The quantum algorithms and applications presented
architecture and circuit layers to achieve resilience
in the previous section, can be described using a
in a co-designed manner. This will require overhaul-
high-level programming language such as Q#, Scaf-
ing classical fault-tolerance schemes such as check-
fold, Quipper and OpenQL [13, 3, 2], and compiled into
pointing and replicated execution and adapt them to
a series of instructions that belong to the (quantum)
the resilience requirements of a quantum computing
instruction set architecture.
system. Error reporting will be propagated up the
As shown in Figure 3.9, the compiler infrastructure system stack to facilitate the application of holistic
for such a heterogeneous system will consist of the fault-tolerance.
classical or host compiler combined with a quantum
compiler. The host compiler compiles for the clas-
sical logic and the quantum compiler will produce After the discussion of the quantum algorithm layer
the quantum circuits (we adopt the circuit model as a and the needed quantum compiler, we focus in the
computational model) and perform reversible circuit next sections on the quantum instruction set archi-
design, quantum gate decomposition and circuit map- tecture and the micro-architecture that we intend to
ping that includes scheduling of operations and place- investigate and build, which should be as independent
ment of qubits. The output of the compiler will be a se- of the underlying quantum technology as possible.
Technology 61
Figure 3.9: Compiler Infrastructure
QISA and Micro-Architecture chitecture (ISA) is the interface between hardware and
software and is essential in a fully programmable clas-
sical computer. So is QISA in a programmable quan-
As we already mentioned, a quantum computer will
tum computer. Existing instruction set architecture
not be an standalone machine but an heterogeneous
definitions for quantum computing mostly focus on
system, in which classical parts will have to interact
the usage of the description and optimization of quan-
with the quantum accelerator or coprocessor. In Fig-
tum applications without considering the low-level
ure 3.10, we show what is currently understood as the
constraints of the interface to the quantum proces-
main system design where accelerators extend the
sor. It is challenging to design an instruction set that
classical server architecture. What is important to
suffices to represent the semantics of quantum ap-
emphasise is that such a heterogeneous multi-core
plications and to incorporate the quantum execution
architecture is basically a multi-instruction set archi-
requirements, e.g., timing constraints.
tecture where e.g. the FPGA, the GPU and now also
the future quantum accelerator have their own in-
struction set which will be targeted by their respec- It is a prevailing idea that quantum compilers generate
tive compilers. Currently, GPUs and FPGAs combined technology-dependent instructions [23, 3, 24]. How-
with many classical processors are a big component of ever, not all technology-dependent information can
modern server architectures to provide the necessary be determined at compile time because some informa-
performance speedup. However, the real performance tion can only be generated at runtime due to hardware
breakthrough may come from adding a fully opera- limitations. An example is the presence of defects on a
tional quantum processor as an accelerator. In the quantum processor affecting the layout of qubits used
case of a quantum accelerator, any application will in the algorithm. In addition, the following observa-
have a lot of classical logic but also calls to quantum tions hold: (1) quantum technology is rapidly evolving,
libraries which will be called from time to time. and more optimized ways of implementing the quan-
tum gates are continuously explored and proposed;
Based on Figure 3.10, we now describe the layers Quan- a way to easily introduce those changes, without im-
tum Instruction Set Architecture (QISA) and the corre- pacting the rest of the architecture, is important; (2)
sponding micro-architecture at a high level such that depending on the qubit technology, the kind, number
we have basic design of a quantum computer for sup- and sequence of the pulses can vary. Hence, it forms
porting the execution of quantum instructions and another challenge to micro-architecturally support a
error correction mechanisms. The instruction set ar- set of quantum instructions which is as independent
62 Technology
Figure 3.10: Heterogeneous System Design with Different Kinds of Accelerators.
as possible of a particular technology and its current The topmost block of the figure represents the quan-
state-of-the-art. tum chip and the other blocks represent the classical
logic needed to control it. The blue parts (classical-
quantum interface) are underlying technology depen-
dent wave control modules. Digital-to-Analogue Con-
Example of Quantum Computer
verters (DAC) are used to generate analogue wave-
Micro-Architecture
forms to drive the quantum chip and Analogue-to-
Digital Converters (ADC) to read the measurement
The overall micro-architecture, as shown in Figure analogue waveform. They receive or send signals to
3.11 is a heterogeneous architecture which includes a the Flux and Wave Control Unit and the Measurement
classical CPU as a host and a quantum coprocessor as Discrimination Unit. In the following paragraphs, we
an accelerator. The input of the micro-architecture is discuss the functional blocks that are needed to exe-
a binary file generated by a compiler infrastructure cute instructions in QISA and to support quantum er-
where classical code and quantum code are combined. ror correction. These blocks are based on the control
As we mentioned previously, the classical code is pro- logic developed for the Transmon-based processor as
duced by a conventional compiler such as GCC and described in [14].
executed by the classical host CPU. Quantum code is
generated by a quantum compiler and executed by the The Quantum Control Unit, called QCU, which is one
quantum coprocessor. As shown in Figure 3.11, based implementation of the QHAL, decodes the instructions
on the opcode of the instruction, the arbiter sends the belonging to the QISA and performs the required quan-
instruction either to the host CPU or to the quantum tum operations, feedback control and QEC. The QCU
accelerator. In the remainder sections, we focus on the can also communicate with the host CPU where clas-
architectural support for the execution of quantum sical computation is carried through the eXchange
instructions and not on the execution of instructions Register File (XRF). The QCU, includes the following
on the classical CPU. The goal of this research part blocks:
is to define a Quantum Hardware Abstraction Layer
(QHAL) such that quantum accelerators can be easily • Quantum Instruction Cache, Qubit Address
integrated with classical processors. The QHAL which Translation and Q Symbol Table: Instructions
will be defined and implemented such that it is used in from the Quantum Instruction Cache are first
the full stack simulation that we intend in this project address-translated by the Qubit Address Trans-
as the final demonstrator of the research. lation module. This means that the compiler-
generated, virtual qubit addresses are translated
In the quantum accelerator, executed instructions into physical ones. This is based on the informa-
in general flow through modules from left to right. tion contained in the Qubit Symbol Table which
Technology 63
Figure 3.11: Example of Quantum Computer Micro-Architecture
64 Technology
provides the overview of the exact physical loca- Conclusion
tion of the logical qubits on the chip and provides
information on what logical qubits are still alive. This section provides an overview of what quantum
This information needs to be updated when quan- computing involves and where we are in the current
tum states or qubits are moved (routing). years. Compared to classical computers, it is clear that
quantum computing is in the pre-transistor phase.
• Execution Controller: The Execution Controller This is mainly due to the fact that multiple technolo-
can be seen as the brain of the Quantum Control gies are competing against each other to be the dom-
Unit and it asks the Quantum Instruction Cache inant qubit-transistor technology. The second great
and the Qubit Address Translation to perform problem that still needs to be solved is the computa-
the necessary instruction fetch and decode. The tional behaviour of the qubits which is many orders
Execution Controller then will make sure the of magnitude lower than any computation performed
necessary steps in the instruction execution are by a CMOS-based transistor. A final challenge is that
taken such as sending them to the Pauli Arbiter the quantum bits are analogue and need to be con-
for further processing. It will also be responsible trolled by a digital micro-architecture. One can think
for keeping the Q Symbol Table up to date. of developing an analogue computer again but that
technology is far from evident in the next 10 years.
• Cycle Generator: As far as error correction is We need to focus very intensively on quantum com-
concerned, the necessary ESM instructions for puting but we have to realise that it takes at least 10
the entire qubit plane are added at run-time by to 15 years before the first quantum accelerators will
the QED Cycle Generator, based on the informa- be available.
tion stored in the Q Symbol Table. It also reduces
substantially the datapath for the execution of References
these instructions.
[1] J. Stephen. “Quantum Algorithm Zoo”. In: list available at
http://math.nist.gov/quantum/zoo (2011).
• QED Unit: The responsibility of the QED Unit is
[2] A. Green et al. “An introduction to quantum program-
to detect errors based on error syndrome mea- ming in quipper”. In: Reversible Computation. Springer. 2013,
surement results. These measurements are de- pp. 110–124.
coded to identify what kind of error was pro- [3] A. J. Abhari et al. Scaffold: Quantum programming language.
duced and on which qubit. The decoder will use Tech. rep. DTIC Document, 2012.
decoding algorithms such as Blossom algorithm. [4] D. Wecker and K. M. Svore. “LIQUi|>: A software design
architecture and domain-specific language for quantum
computing”. In: arXiv preprint arXiv:1402.4467 (2014).
• The Pauli Frame Unit and Pauli Arbiter: The
Pauli Frame mechanism [25] allows us to classi- [5] A. G. Fowler, M. Mariantoni, J. M. Martinis, and A. N. Cleland.
“Surface codes: Towards practical large-scale quantum com-
cally track Pauli errors without physically cor- putation”. In: Physical Review A 86.3 (2012), p. 032324.
recting them. The Pauli Frame Unit manages
[6] H. Bombín. “Dimensional jump in quantum error correc-
the Pauli records for every data qubit. The Pauli tion”. In: New Journal of Physics 18.4 (2016), p. 043038.
Arbiter receives instructions from the Execution
[7] X. Fu et al. “A heterogeneous quantum computer architec-
Controller and the QED Unit. It skips all opera- ture”. In: Proceedings of the ACM International Conference on
tions on ancilla qubits and sends them directly Computing Frontiers. ACM. 2016, pp. 323–330.
to the PEL, regardless of the operation type. [8] D. Risté, S. Poletto, M.-Z. Huang, A. Bruno, V. Vesterinen,
O.-P. Saira, and L. DiCarlo. “Detecting bit-flip errors in a
logical qubit using stabilizer measurements”. In: Nature
• Logical Measurement Unit: The function of the
communications 6 (2015).
Logical Measurement Unit is to combine the data
[9] R. P. Feynman. “Simulating physics with computers”. In:
qubit measurement results into a logical mea- International Journal of Theoretical Physics 21.6-7 (1982), 467–
surement result for a logical qubit. The Logical 488. doi: {10.1007/BF02650179}.
Measurement Unit sends the logical measure- [10] M. Hamburg and F. Collins. “The path to personalized
ment result to the ERF, where it can be used in medicine”. In: New England Journal of Medicine (2003),
Binary Control by the Execution Controller, or 363:301–304.
picked up by the host processor and used e.g. in
branch decisions.
Technology 65
[11] R. Gullapalli et al. “Next generation sequencing in clini-
cal medicine: Challenges and lessons for pathology and
biomedical informatics”. In: J. Pathol. Inform. (2012), 3(1):40.
[12] M. Stratton et al. “The cancer gene”. In: Nature (2009),
458:719–724.
[13] “The Q# Programming Language”. In:
https://docs.microsoft.com/en-us/quantum/quantum-
qr-intro?view=qsharp-preview (2017).
[14] D. Riste, S. Poletto, M. .-.-Z. Huang, et al. “Detecting bit-flip
errors in a logical qubit using stabilizer measurements”.
In: Nat Commun 6 (Apr. 2015). url: http://dx.doi.org/
10.1038/ncomms7983.
[15] A. Córcoles, E. Magesan, S. J. Srinivasan, A. W. Cross, M.
Steffen, J. M. Gambetta, and J. M. Chow. “Demonstration of
a quantum error detection code using a square lattice of
four superconducting qubits”. In: Nature communications 6
(2015).
[16] J. Kelly et al. “State preservation by repetitive error de-
tection in a superconducting quantum circuit”. In: Nature
519.7541 (2015), pp. 66–69.
[17] B. T. Lidar D. Quantum Error Correction. 2013.
[18] P. W. Shor. “Scheme for reducing decoherence in quantum
computer memory”. In: Physical review A 52.4 (1995), R2493.
[19] A. Steane. “Multiple-particle interference and quantum er-
ror correction”. In: Proceedings of the Royal Society of London
A: Math., Phys. and Eng. Sciences. 1996.
[20] A. R. Calderbank and P. W. Shor. “Good quantum error-
correcting codes exist”. In: Phys. Rev. A 54.2 (1996), p. 1098.
[21] D. Gottesman. “Class of quantum error-correcting codes
saturating the quantum Hamming bound”. In: Phys. Rev. A
54.3 (1996), p. 1862.
[22] H. Bombin and M. A. Martin-Delgado. “Topological quan-
tum distillation”. In: Phys. Rev. Lett. 97.18 (2006), p. 180501.
[23] K. M. Svore, A. V. Aho, A. W. Cross, I. Chuang, and I. L.
Markov. “A layered software architecture for quantum
computing design tools”. In: Computer 1 (2006), pp. 74–83.
[24] T. Häner, D. S. Steiger, K. Svore, and M. Troyer. “A software
methodology for compiling quantum programs”. In: arXiv
preprint arXiv:1604.01401 (2016).
[25] E. Knill. “Quantum computing with realistically noisy de-
vices”. In: Nature 434.7029 (Mar. 2005), pp. 39–44. url: http:
//dx.doi.org/10.1038/nature03350.
66 Technology
3.4.3 Beyond CMOS Technologies orders of magnitudes higher than in conventional sili-
con technology. Also, the switching speed is by far not
Nano structures like Carbon Nanotubes (CNT) or Sil- comparable to regular CMOS technology. The main
icon Nanowires (SiNW) expose a number of special purpose of this demonstration chip is to prove that
properties which make them attractive to build logic the difficult manufacturing process and the inherent
circuits or memory cells. uncertainties of using CNTs can be managed even at
large scale.
Nanotube-based RAM is a proprietary memory tech-
Carbon Nanotubes
nology for non-volatile random access memory de-
veloped by Nantero (this company also refers to this
Carbon Nanotubes (CNTs) are tubular structures of memory as NRAM) and relies on crossing CNTs as
carbon atoms. These tubes can be single-walled described above. An NRAM “cell” consists of a non-
(SWNT) or multi-walled nanotubes (MWNT). Their di- woven fabric matrix of CNTs located between two elec-
ameter is in the range of a few nanometers. Their elec- trodes. The resistance state of the fabric is high (rep-
trical characteristics vary, depending on their molec- resenting off or 0 state) when (most of) the CNTs are
ular structure, between metallic and semiconducting not in contact and is low (representing on or 1 state)
[1]. CNTs can also be doped by e.g. nitrogen or boron vice versa. Switching the NRAM is done by adjusting
casting n-type or p-type CNTs. the space between the layers of CNTs. In theory NRAM
A CNTFET consists of two metal contacts which are can reach the density of DRAM while providing per-
connected via a CNT. These contacts are the drain and formance similar to SRAM [5]. NRAMs are supposed
source of the transistor. The gate is located next to or to be produced in volume in 2020.
around the CNT and separated via a layer of silicon Currently there is little to no activity to develop CNT
oxide [2]. Also, crossed lines of appropriately selected based programmabe logic devices in crossbar archi-
CNTs can form a tunnel diode. This requires the right tecture.
spacing between the crossed lines. The spacing can
be changed by applying appropriate voltage to the
crossing CNTs. Impact on Hardware Apart from active functional-
ity, CNTs are also excellent thermal conductors. As
CNTs can be used to build nano-crossbars, which logi- such, they could significantly improve conducting
cally are similar to a PLA (programmable logic array). heat away from CPU chips [6].
The crosspoints act as diodes if the CNTs have the
proper structure in both directions They offer wired- The impact of CNTFETs is not yet clear. Their major
AND conjunctions of the input signal. Together with benefit of CNTFETs is their superior energy efficiency
inversion/buffering facilities, they can create freely compared to CMOS transistors. With additional
programmable logic structures. The density of active The use of CNTs in RAMs is at a comercial threshold
elements is much higher as with individually formed and will have a similar impact as other memristor
CNTFETs. technologies.
CNT based NanoPLAs promise to deliver very high
Current State of CNT In September 2013, Max Shu- logic density for programmable logic. This would en-
laker from Stanford University published a computer able the inclusion of very large programmable devices
with digital circuits based on CNTFETs. It contained a into processors and could thus be the basis for large
1 bit processor, consisting of 178 transistors and runs scale application specific accelerators.
with a frequency of 1 kHz [3]. The current state on
CNTFETs was demonstrated by Shulaker (now at MIT)
Silicon Nanowires
by implementing a RISC V based processor with 32 bit
instructions and a 16 bit datapath running at 10kHz
and consisting og 14000 transistors[4]. Silicon Nanowires are the natural extension of FinFet
technology. Silicon NanoWires are silicon filaments
While the RISC V implementation is an impressive with circular cross sections surrounded by a gate all-
achievement, it should be noted that the size of the around. This structure gives an excellent electrostatic
CNTFETS currently is in the range of 1 µm2 which is control of the charge under the gate area, and thus
Technology 67
enhances conduction in the ON region and reduces reach a logic density which is two orders of magni-
leakage in the OFF region. Silicon Nanowires can be tude higher than traditional FPGAs built with CMOS
horizontally stacked, thus providing designers with technology.
a small stack of transistors in parallel [7]. Intel has
Currently, less research on nanowires is active than
shown working Silicon Nanowires based transistors
in the early 2000s. Nevertheless, some groups are
and their extension to Silicon Nanosheets, that are
pushing the usage of nanowires for the creation of
similar to stacks of horizontal nanowires but with an
logic circuits. At the same time, more research is going
oval-elongated cross section.
on to deal with the high defect density.
Silicon Nanowires can be used to design controlled-
polarity transistors when interfaced to nickel
source/drain contacts [8]. Such transistors have Impact on Hardware SiNW devices are currently
two independent gates. A polarity gate dopes only used in NanoPLAs. The major reason for this
electrostatically (i.e., through a radial field) the restriction is their manufacturing process. They can
underlying Schottky junctions at the contact inter- only efficiently be manufactured if identical struc-
faces. As a result, the polarization of this gate can tures are created in large quantity. This perfectly fits
block the flow of either electrons or holes in this NanoPLAs but not irregular device structures such as
natively ambipolar device, thus generating a p or an processors. Thus, SiNW will most likely only be rel-
n transistor respectively. The control gate is then evant for the realization of programmable logic, as
used to turn the transistor on or off. Also these used e.g. in application specific accelerators.
types of transistors can be fabricated as horizontal
stacks. Their advantage is their lack of implanted
Two-dimensional (2D) Electronics
regions, and thus the avoidance of the related dopant
variability problems.
Groundbreaking research on graphene has paved the
Vertical nanowires have also been studied, and they way to explore various two-dimensional (2D) elec-
can potentially yield very dense computational struc- tronic materials. The applications of graphene in na-
tures [9]. Nevertheless, they have a much higher fabri- noelectronics are limited, because graphene does not
cation complexity and their readiness for being used have a bandgap and thus it is harder (though not im-
in computational circuits is still far. possible) to fabricate efficient transistors. This decade
has shown a surge in research on materials with a
SiNWs can be formed in a bottom up self-assembly structure like graphene, where a mono layer (or few
process. This might lead to substantially smaller struc- adjacent layers) can provide the means for fabricat-
tures as those that can be formed by lithographic pro- ing transistors. Despite the diversity in conduction
cesses. Additionally, SiNWs can be doped and thus, properties and atomic composition, all 2-D materials
crossed lines of appropriately doped SiNW lines can consist of covalently-bonded in-plane layers that are
form diodes. held together by weak van der Waals interactions to
form a 3-D crystal. In general, a 2D materials are tran-
sition metal dichalcogenides (TMDC) composed by a
Current State of SiNW Nano crossbars have been transition metal sandwiched between two chalcogen
created from SiNWs [10]. Similar to CNT based cross- atoms.
bars, the fundamental problem is the high defect den-
sity of the resulting circuits. Under normal semicon- Radisavljevic and co-workers (under the supervision
ductor classifications, these devices would be consid- of A. Kis [13]) designed and fabricated the first tran-
ered broken. In fact, usage of these devices is only sistor in Molybdenum DiSulfide (MOS2), which consti-
possible, if the individual defects of the devices can be tuted the entry point of 2D materials into nanoelec-
respected during the logic mapping stage of the HW tronics. Wachter [14] designed in 2017 the first pro-
synthesis [11]. cessor, though simple, in this material, with about 120
transistors. A major limitation of MOS2 is the inabil-
While there is not such an impressive demonstration ity of realizing complementary devices and circuits,
chip for SiNWs as for CNTFETS, the area supremacy thus requiring the use of depletion loads like in NMOS
of SiNW circuits has already been shown in 2005 by that contribute to static power consumption. Other
DeHon[12]. He demonstrated that such devices can 2D materials, like Tungsten DiSelenide (WSe2) have
68 Technology
been used to realize both n and p transistors, thus en- Likharev [19] brought back strong interest in SCE by
abling complementary logic in 2D. Resta [15] designed proposing rapid single flux quantum (RSFQ) circuits.
and fabricated controlled-polarity WSe2 transistors, In these circuits, the logic values (TRUE, FALSE) are
and with these elements he fabricated and validated represented by the presence or absence of single flux
a simple cell library including ANDs, ORs, XORs and quantum pulses called fluxons. Junctions are DC bi-
MAJority gates. This field is evolving rapidly, with ased and when a pulse is applied to the junction, the
few materials being studied for their properties and small associated current pulse can be sufficient to
attitude towards supporting high-performance, low- drive the current level over its threshold and to gen-
power computation. A recent survey is [16]. erate a pulse that can be propagated through the cir-
cuit. This type of behavior is often called Josephson
transmission line (JTL) and it is the basic operational
Superconducting Electronics principle of RSFQ circuits that conditionally propa-
gate flux pulses. A specific feature of RSFQ circuits is
Superconducting electronics (SCE) is a branch of en- that logic gates are clocked, and that the overall circuit
gineering that leverages computation at few degrees is pipelined. The RSFQ technology evolved in many
Kelvin (typically 4K) where resistive effects can be directions, e.g., energy-efficient SFQ (eSFQ) [20], recip-
neglected and where switching is achieved by Joseph- rocal quantum logic (RQL) [17] and low-voltage RSFQ
son junctions (JJ). Current difficulties in downscaling (LV-RSFQ) [21]. Various realizations of ALUs have been
CMOS have made superconducting electronics quite reported, with deep-pipelined, wave-pipelined and
attractive for the following reasons. First, the technol- asynchronous operation [22].
ogy can match and extend current performance re- This technology (and its variations) has several pe-
quirements at lower energy cost. ALU prototypes have culiarities. In particular, it is worth mentioning the
been shown to run at 20-50GHz clock rates and with following. Pulse splitters are used to handle multi-
increasingly higher power efficiency. Whereas experi- ple fanouts. Registers are simpler to implement (as
mental data vary according to circuit family, size and compared to CMOS). Conversely, logic gates are more
year of production, it is possible to measure a power complex (in terms of elementary components). Logic
consumption that is two orders of magnitude lower as gates can be realized by combining JJs and inductors
compared to standard CMOS [17], while considering a with different topologies. A fundamental logic gate
cryocooling efficacy of 0.1 %. Performance of single- in RSFQ is a majority gate, that can be simplified to
precision operations is around 1 TFLOPS/Watt [18]. realize the AND and OR functions. Whereas in the
Second, today current superconductor circuits are de- past the interest in this technology was related to the
signed in a 250 nm technology, much easier to realize realization of arithmetic units (e.g., adders and multi-
in integrated fashion (as compared to 5 nm CMOS) and pliers) that exploit widely the majority function, today
with a horizon of a 10-50X possible downscaling, thus majority logic is widely applicable to general digital
projecting one or two decades of further improvement. design.
Cryocooling efficacy is expected to improve as well.
Moreover, superconducting interconnect wires allow Recent research work has addressed technologies that
the ballistic transfer of picosecond waveforms. There- target low-energy consumption. This can be achieved
fore, SCE is a strong candidate for high-performance by using AC power (i.e., alternating current supply).
large system design in the coming decade. In RQL, power is carried by transmission lines. Two
signals in quadrature are magnetically coupled to gen-
IBM led a strong effort in SCE in the 70s with the objec- erate a 4-phase trigger. A TRUE logic signal is rep-
tive of building computers that would outperform the resented by sending a positive pulse followed by a
currently-available technology. The circuits utilized negative pulse, while a FALSE logic signal is just the
Josephson junctions exhibiting hysteresis in their re- absence of the pulsed signals. An alternative tech-
sistive states (i.e., resistive and superconductive). The nology is adiabatic quantum flux parametron (AQFP)
JJ acts as a switch that can be set and reset by applying where the circuits are also biased by AC power. (A
a current. A logic TRUE is associated with the JJ in its parametron is a resonant circuit with a nonlinear re-
resistive state, and a logic FALSE with its superconduc- active element.) As an example, Takeuchi [23] used
tive state. This effort faded in the mid 80s, because of a 3-phase bias/excitation as both multi-clock signal
various drawbacks, including the choice of materials and power supply. In general, signal propagation in
and the latching operation of logic [19]. AQFP circuits requires overlapping clock signals from
Technology 69
neighboring phases [24]. In AQFP, inductor loop pairs [13] B. Radisavljevic, A. Radenovic, J. Brivio, and A. Kis. “Single-
are used to store logic information in terms of flux layer MoS2 transistors”. In: Nature Nanotech 6 (2011).
quanta depending on the direction of an input current [14] S. Wachter, D. Polyushkin, O. Bethge, and T. Mueller. “A mi-
and to the magnetic coupling to other inductors. A croprocessor based on a two-dimensional semiconductor”.
In: Nature Communications 8 (2017), p. 14948.
corresponding output current represents the output
of a logic gate. It was shown [25] that the ¤parallel [15] G.V. Resta, Y. Balaji, D. Lin, I.P. Radu, F. Catthoor, P.-E. Gail-
lardon, G. De Micheli. “Doping-Free Complementary Logic
combination¤ of three AQFP buffers yields a majority Gates Enabled by Two-Dimensional Polarity-Controllable
gate. Recent publications [24] have also advocated the Transistors”. In: ACS Nano 12.7 (2018), pp. 7039–7047.
design and use of majority logic primitives in AQFP [16] G. Resta, A. Leondhart, Y. Balaji, S. De Gendt, P.-E. Gail-
design. Simple cell libraries have been designed for lardon G. De Micheli. “Devices and Circuits using Novel
AQFP as well as some simple synthesis flow from an 2-Dimensional Materials: a Perspective for Future VLSI
HDL description to a cell-based physical design. Systems”. In: IEEE Transaction on Very Large Scale Integration
Systems 27.7 (July 2019).
[17] Q. Herr, A. Herr, O. Oberg and A. Ioannidis. “Ultra-Low
References Power Superconducting Logic”. In: Journal of Applied Physics
109 (2011).
[1] Wikipedia. carbon nanotubes. url: https://en.wikipedia.
[18] M. Dorojevets, Z. Chen, C. Ayala and A. Kasperek. “Towards
org/wiki/Carbon%5C_nanotube.
32-bit Energy-Efficient Superconductor RQL Processors:
[2] L. Rispal. “Large scale fabrication of field-effect devices The Cell-Level Design and Analysis of Key Processing and
based on in situ grown carbon nanotubes”. PhD thesis. On-Chip Storage Units”. In: IEEE Transactions on Applied Su-
Darmstädter Dissertationen, 2009. perconductivity 25.3 (June 20115).
[3] M. M. Shulaker, G. Hills, N. Patil, H. Wei, H.-Y. Chen, H.-S. P. [19] K. Likharev and V. Semenov. “RSFQ Logic/Memory Family:
Wong, and S. Mitra. “Carbon nanotube computer”. In: Na- A New Josephson-Junction Technology for Sub-terahertz
ture 501.7468 (2013), pp. 526–530. Clock-Frequency Digital Circuits”. In: IEEE Transactions on
[4] G. Hills, C. Lau, A. Wright, et al. “Modern microprocessor Applied Superconductivity 1.3 (Mar. 1991), pp. 3–28.
built from complementary carbon nanotube transistors”. [20] O.Mukhanov. “Energy-Efficient Single Flux Quantum Tech-
In: Nature 572 (Aug. 2019), pp. 595–602. doi: 10 . 1038 / nology”. In: IEEE Transactions on Applied Superconductivity
s41586-019-1493-8. 21.3 (June 2011), pp. 760–769.
[5] Wikipedia. Nano-RAM. url: https://en.wikipedia.org/
[21] M. Tanaka, A. Kitayama, T. Koketsu, M. Ito and A. Fujimaki.
wiki/Nano-RAM.
“Low-Energy Consumption RSFQ Circuits Driven by Low
[6] J. Hruska. This carbon nanotube heatsink is six times more ther- Voltages”. In: IEEE Transactions on Applied Superconductivity
mally conductive, could trigger a revolution in CPU clock speeds. 23.3 (2013).
url: http://www.extremetech.com/extreme/175457-
[22] T. Filippov, A. Sahu, A. Kirchenko, I. Vernik, M. Dorojevets,
this - carbon - nanotube - heatsink - is - six - times -
C. Ayala and O. Mukhanov. “20 GHz operation of an Asyn-
more - thermally - conductive - could - trigger - a -
chronous Wave-Pipelined RSFQ Arithmetic-Logic Unit”. In:
revolution-in-cpu-clock-speeds.
Elsevier SciVerse Science Direct, Physics Procedia 36 (2012).
[7] M. Zervas, D. Sacchetto, G. De Micheli and Y. Leblebici. “Top-
down fabrication of very-high density vertically stacked [23] N. Takeuchi, D. Ozawa, Y. Yamanashi and N. Yoshikawa. “An
silicon nanowire arrays with low temperature budget”. In: Adiabatic Quantum Flux Parametron as an Ultra-Low Power
Microelectronic Engineering 88.10 (2011), pp. 3127–3132. Logic Device”. In: Superconductor Science and Technology 26.3
(2013).
[8] P.-E. Gaillardon, L. G. Amaru’, S. K. Bobba, M. De Marchi, D.
Sacchetto, and G. De Micheli. “Nanowire systems: technol- [24] R. Cai, O. Chen, A. Ren, N. Liu, C. Ding, N. Yoshikawa and
ogy and design”. In: Philosophical Transactions of the Royal Y. Wang. “A Majority Logic Synthesis Framework for Adi-
Society of London A 372.2012 (2014). abatic Quantum-Flux-Parametron Superconducting Cir-
cuits”. In: Proceedings of GVLSI. 2019.
[9] Y. Guerfi, and GH. Larrieu. “Vertical Silicon Nanowire Field
Effect Transistors with Nanoscale Gate-All-Around”. In: [25] C. Ayala, N. Takeuki, Y. Yamanashi and N. Yoshikawa.
Nanoscale Research Letters (2016). “Majority-Logic-Optimizaed Parallel Prefix Carry Look-
Ahead Adder Families Using Adiabatic Quantum-Flux-
[10] S. Devisree, A. Kumar, and R. Raj. “Nanoscale FPGA with
Parametron Logic”. In: IEEE Transactions on Applied Super-
reconfigurability”. In: Electrotechnical Conference (MELECON),
conductivity 27.4 (June 2017).
2016 18th Mediterranean. IEEE. 2016, pp. 1–5.
[11] M. Zamani and M. B. Tahoori. “Self-timed nano-PLA”. In:
2011 IEEE/ACM International Symposium on Nanoscale Architec-
tures. June 2011, pp. 78–85. doi: 10.1109/NANOARCH.2011.
5941487.
[12] A. Dehon. “Nanowire-Based Programmable Architectures”.
In: J. Emerg. Technol. Comput. Syst. 1.2 (July 2005), pp. 109–
162. doi: 10.1145/1084748.1084750. url: https://doi.
org/10.1145/1084748.1084750.
70 Technology
4 HPC Hardware Architectures
Potential long-term impacts of disruptive technolo- Memory is accessed by linear adresses in chunks of
gies could concern the processor logic, the processor- word or cache-line size, mass storage as files.
memory interface, the memory hierarchy, and future
But this model of a memory hierarchy is not effective
hardware accelerators. We start with potential future
in terms of performance for a given power envelop.
memory hierarchies (see Sect. 4.1) including memris-
The main source of inefficiency in the meantime be-
tor technologies, and the inherent security and pri-
came data movement: the energy cost of fetching a
vacy issues. Next we look at the processor-memory
word of data from off-chip DRAM is up to 6400 times
interface, in particular near- and in-memory comput-
higher than operating on it [1]. The current conse-
ing (see Sect. 4.2). We conclude with future hardware
quence is to move the RAM memory closer to the pro-
accelerators (see Sect. 4.3) and speculate on future
cessor by providing High-Bandwidth Memories.
processor logic and new ways of computing (see Sect.
4.4).
4.1.2 High-Bandwidth Memory (HBM)
Interposer
Package Substrate
Figure 4.1: High Bandwidth Memory utilizing an active silicon Interposer [2]
consumption is reduced because of the short wire demonstrated by Fig. 4.2 for a potential memory hier-
lengths of TSVs and interposers. Simultaneously, a archy of a future supercomputer. Memristors as new
high communication bandwidth between layers can types of NV memories can be used in different layers
be expected leading to particularly high processor-to- of the memory hierarchy not only in supercomputers
memory bandwidth. but in all kinds of computing devices. Depending on
which memory technologies mature, this can have
different impacts. Fast non-volatile memories (e.g.
4.1.3 Storage-Class Memory (SCM) STT-RAM) offer the opportunity of merging cache and
memory levels. Mid-speed NV memories (e.g. PCM)
Storage-Class Memory (SCM) currently fills the la- could be used to merge memory and storage levels.
tency gap between fast and volatile RAM-based mem-
ory and slow, but non-volatile disk storage in super-
computers. The gap is currently filled by Flash stor- Shallow Memory Hierarchy
age, but could in future be extended by memristive
NVM with access times, which are much closer to RAM
access times than Flash technology. On the other hand, the memory hierarchy might be-
come flatter by merging main memory with storage in
In that case memristive NVM based SCM could blur particular for smaller systems. Such a shallow mem-
the distinction between memory and storage and re- ory hierarchy might be useful for future embedded
quire new data access modes and protocols that serve HPC systems. The cache, main memory and mass stor-
both “memory” and “storage”. These new SCM types age level might be merged to a single level, as shown in
of non-volatile memory could even be integrated on- Figure 4.3. As a result, the whole system would provide
chip with the microprocessor cores as they use CMOS- an improved performance, especially in terms of real-
compatible sets of materials and require different de- time operation. An increased resistance against radia-
vice fabrication techniques from Flash. In a VLSI post- tion effects (e.g. bit flips) would be another positive
processing step they can be integrated on top of the effect. Also, a shallow memory hierarchy would en-
last metal layer, which is often denoted as a back-end able applications to use more non-uniform or highly
of line (BEOL) step (see Sect. 3.2.3). random data access.
Deep Memory Hierarchy Merging main memory and mass storage allows appli-
cations to start much faster. It might be helpful for
Low-speed non-volatile memories might lead to ad- crash recovery and it can reduce energy consumption
ditional levels in the memory hierarchy to efficiently as it takes less time to activate/deactivate applica-
close the gap between mass-storage and memory as tions.
L1 $
SRAM, expensive cost/mm2
L2 $
L3 $ SRAM, STT-RAM
Storage Class Memory SCM1 PCM or other NVM, fast read, less fast write
Fast NVM
$ + Memory + SCM1
Off Chip
Optional SCM2
Figure 4.3: Usage of NVM in a future (embedded) HPF systems with shallow memory hierarchy.
High-Max BW
2 Peripheral
SAs circuits
SAs • Computation-Inside-Memory (CIM) of In-Memory
Computing (IMC): in this class, the computing re-
High BW
sult is produced within the memory core, (i.e., the
3 Extra logic circuits
computing takes places within one of the memo-
Low BW ries). It consists also of two flavours: CIM-Array
(CIM-A) and CIM-Periphery (CIM-P). In CIM-A,
4 Computational cores
the computing result is produced within the ar-
ray (circle 1 in Fig. 4.4), while in CIM-P the result
Figure 4.4: Computer Architecture classification is produced in the memory peripheral circuits
(circle 2 in the figure). Examples of CIM-A ar-
chitectures use memristive logic designs such
4.2.1 Classification of Computer Architectures
as MAGIC and imply [3, 4], and examples of ex-
amples of CIM-P architectures containing logical
As shown in Fig. 4.4, a computer architecture (or com- bit-wise operations and vector-matrix multipli-
puting system) consists of (one or more) computa- cations [5, 6].
tional units and (one or more) memory cores. A mem-
ory core typically consists of one or more cell arrays Table 4.1 shows a qualitative comparison of the four
(used for storage) and memory peripheral circuits (op- architecture sub-classes [1]. Both CIM-A and CIM-
timised and used to access the memory cells). These P architecture have a relatively low amount of data
memory cores can also integrated with some dedi- movement outside the memory core, as the process-
cated logic circuit in a form of System-in-Packages ing occurs inside the memory core. Therefore, they
(SiP). Although many criteria can be used to classify have the potential to alleviate the memory bottleneck.
computer architectures, computation location seems Instead of moving data from the memory to the com-
to be the best to use in order to define the different putational cores, in these architectures the instruc-
classes in unique manner. Computation location indi- tions are moved and directly applied to the memory;
cates where the result of the computation is produced. these instructions typically operate on a large data
Hence, depending on where the result of the compu- set, hence a high level of parallelism can be obtained.
tation is produced, we can identify four possibilities Data alignment is required for all architectures. How-
as Fig. 4.4 shows; they are indicated with four circled ever, CIM-A and CIM-P classes perform computations
numbers and can be grouped into two classes [1]. directly on the data residing inside the memory, and
hence, the robustness and performance are impacted
• Computation-outside-Memory (COM): In this more by data misalignment. Note that performing a
class, the computing takes place outside the data alignment cannot be handled by host processors
memory. Hence, the need of data move- in case of CIM architectures due to a far communica-
ment. There are two flavours of this class: a) tion distance, while adding additional logic inside the
Computation-Outside-Memory Far (COM-F), and memory core to handle this is also not trivial. Avail-
b) Computation-Outside-Memory Near (COM-N). able bandwidth is another important metric. CIM-A
COM-F refers to the traditional architectures architectures may exploit the maximum bandwidth,
where the computing takes place in the compu- as operations happen inside the memory array. CIM-P
tational cores such as such as CPU (circle 4 in architectures have a bandwidth range from high to
Fig. 4.4); the memory is seen to be far from the max, depending on the complexity of the memory
processing unit. In order to reduce the length peripheral circuitry. For COM-N, the bandwidth is
bounded by on-chip interconnections between the memory chips in order to carry out processing steps,
memory core and extra logic circuits; for example, in like e.g. stencil operations, or vector operations on
Hybrid Memory Cube [2] the bandwidth is limited by bulk of data. Near-memory computing can be seen as
the number of TSVs and available registers. This band- a co-processor or hardware accelerator. Near-memory
width for TSV is considered high in comparison with computing can be realized by replacing or enhancing
COM-F, where the bandwidth is even lower due to off- the memory controller to be able to perform logic op-
chip interconnections [7]. Memory design efforts are erations on the row buffer. In HBM the Logic Die (see
required to make the computing feasible, especially Fig. 4.1) could be enhanced by processing capabilities,
for CIM. CIM-A architectures require a redesign of the and the memory controller can be enabled to perform
cell, which needs a huge effort. CIM-P architecture re- semantically richer operations than load and store,
quire complex read-out circuits as the output value of respectively cache line replacements.
two or more accessed cells may end up in multiple lev-
els, resulting in large complexity which may limit the Near-memory computation can provide two main op-
scalability. COM-N and COM-F architectures utilize the portunities: (1) reduction in data movement by vicin-
memory in a conventional way, and hence, standard ity to the main storage resulting in reduced memory
memory controllers can be used. Note that CIM-A has access latency and energy, (2) higher bandwidth pro-
a low scalability due to several reasons such as the vided by Through Silicon Vias (TSVs) in comparison
lack/ complexity of interconnect network within the with the interface to the host limited by the pins [9].
memory array it needs. COM-N has a medium scala-
Processing by near-memory computing reduces en-
bility even though the logic layer of memory SiP has
ergy costs and goes along with a reduction of the
more processing resources than peripheral circuits; it
amount of data to be transferred to the processor.
cannot accommodate many complex logic units. COM-
Near-memory computing is to be considered as a near-
F has high scalability due to a mature interconnect
and mid-term realizable concept.
network and large space for logic devices.
Proposals for near-memory computing architectures
currently don’t rely yet on memristor technologies
4.2.2 Near Memory Computing NMC of but on innovative memory devices which are commer-
COM-N cially available in the meantime such as the Hybrid
Memory Cube from Micron [10] [11]. It stacks multi-
Near-memory computing (Near-Memory Processing, ple DRAM dies and a separate layer for a controller
NMP) is characterized by processing in proximity of which is vertically linked with the DRAM dies. The
memory to minimize data transfer costs [8]. Compute Smart Memory Cube proposed by [9] is the proposal
logic, e.g. small cores, is physically placed close to the of a near-memory computing architecture enhancing
Such technologies exploit similar to HBM tight 3D in- Inputs Representation Inputs Representation
Quantum
Potential Off-Chip GPU FPGA NPU
Computer
Accelerators
4.4.2 Power is Most Important when may be very useful for many segments of the market.
Committing to New Technology Reinvestment in this field is essential since it may
change the way we are developing and optimizing
For the last decade, power and thermal management algorithms and code.
has been of high importance. The entire market fo-
Dark Silicon (i.e. large parts of the chip have to stay
cus has moved from achieving better performance
idle due to thermal reasons) may not happen when
through single-thread optimizations, e.g., speculative
specific new technologies ripen. New software and
execution, towards simpler architectures that achieve
hardware interfaces will be the key for successfully
better performance per watt, provided that vast par-
applying future disruptive technologies.
allelism exists. The problem with this approach is
that it is not always easy to develop parallel programs
and moreover, those parallel programs are not always
4.4.3 Locality of References
performance portable, meaning that each time the
architecture changes, the code may have to be rewrit-
ten. Locality of references is a central assumption of the
way we design systems. The consequence of this as-
Research on new materials, such as nanotubes and sumption is the need of hierarchically arranged mem-
graphene as (partial) replacements for silicon can turn ories, 3D stacking and more.
the tables and help to produce chips that could run
at much higher frequencies and with that may even But new technologies, including optical networks on
use massive speculative techniques to significantly die and Terahertz based connections, may reduce the
increase the performance of single threaded programs. need for preserving locality, since the differences in
A change in power density vs. cost per area will have access time and energy costs to local memory vs. re-
an effect on the likelihood of dark silicon. mote storage or memory may not be as significant in
future as it is today.
The reasons why such technologies are not state of the
art yet are their premature state of research, which When such new technologies find their practical use,
is still far from fabrication, and the unknown produc- we can expect a massive change in the way we are
tion costs of such high performing chips. But we may building hardware and software systems and are or-
assume that in 10 to 20 years the technologies may ganizing software structures.
be mature enough or other such technologies will be
discovered. The restriction here is purely the technology, but with
all the companies and universities that work on this
Going back to improved single thread performance problem, we may consider it as lifted in the future.
5.1 Accelerator Ecosystem Interfaces segmentation [6] or coalescing [7] are promising. With
emerging memory technologies, novel abstractions
for isolation, protection and security are needed that
The slowdown in silicon scaling, and the emergence of
lend themselves well to efficient hardware support
heterogeneous logic and memory technologies have
and enable a continued scaling in memory capacity in
led to innovation in interface technologies to connect
an accelerator-rich environment.
various components of an HPC platform node together,
provide unified protected abstractions to access mem-
ory as well as access to the network and storage. Ex-
ample consortia in recent years that have emerged 5.2 Integration of Network and
and provide a variety of connectivity among hetero- Storage
geneous components and peripherals are CCIX [1],
Gen-Z [2], CAPI [3] and NVLink [4]. These interface
technologies vary in compatablity among each other, Modern HPC platforms are based on commodity server
the degree with compatibility with legacy interfaces components to benefit from economies of scale and
(e.g., PCIe) and whether they support hardware cache primarily differ from datacenters in that they incor-
coherence and conventional memory abstractions. porate cutting-edge network fabrics and interfaces.
The canonical blade server architecture finds its roots
A key challenge in supporting accelerator-rich en- in the desktop PC of the 80’s with the CPU (e.g., x86
vironments in future platforms will be supporting sockets) managing memory at hardware speed and
higher-level software abstractions in hardware that the OS (e.g., Linux) moving data between the memory
would not only enable protected seamless sharing of and storage/network over legacy I/O interfaces (e.g.,
memory among near-neighbor components but also PCIe) in software.
allow accelerators offering services over the network
which are coordinated by a host CPU but with the Unfortunatetly, the legacy OS abstractions to commu-
host CPU and OS outside the critical path of compu- nicate with the network interface and storage are a
tationg and communication across nodes. Microsoft bottleneck in today’s system and will be a fundamen-
Catapult [5] placing FPGA’s directly on the network to tal challenge in future HPC platforms. Because the
enable communication with other FPGA’s across the network/storage controllers can not access the host
network without the host in the way. memory directly, recent years have seen a plathora of
technologies that integrate private memory and logic
Another key challenge in future accelerator-rich envi- closer to network/storage controllers to add intelli-
ronments is moving away from virtual memory and gence to services but result in a major fragmentation
paging abstractions for protected access. Conven- of silicon in the platform across I/O interfaces and
tional OS abstractions for address translation and do not fundamentally address the legacy interface
their architectural support in modern platforms date bottleneck. For example the latest HPC flash array
back to desktop PC’s of the 80’s, and are already at lim- controllers or network interface cards [8] integrate 32
its for Terabyte-scale memory nodes requiring tens of out-of-order ARM cores that can reach tens of GB of
thousands of TLB entries in hierarchies per core. Many private memory and can directly talk to PCIe-based
important classes of emerging accelerators are limited accelerators.
in efficiency and performance by data movement and
require protected access to memory that can reach The emerging consortia for new interfaces (Section
orders of magnitude more capacity than conventional 5.1) help with a closer coordination of hardware com-
address translation can support. Recent techniques to ponents not just between the host CPU and accelera-
reduce fragmentation in address translation through tors but also with the network. Future interfaces will
6.1 Green ICT and Power Usage We assert that to reach Exascale performance and be-
Effectiveness yond an improvement must be achieved in driving the
Total Power usage effectiveness (TUE) metric [7]. This
metric highlights the energy conversion costs within
The term “Green ICT” refers to the study and prac- the IT equipment to drive the computing elements
tice of environmentally sustainable computing. The (processor, memory, and accelerators). As a rule of
2010 estimates put the ICT at 3% of the overall carbon thumb, in the pre-Exascale servers the power conver-
footprint, ahead of the airline industry [1]. Modern sion circuitry consumes 25% of all power delivered
large-scale data centres are already multiple of tens to a server. Facility targeting TUE close to one will
of MWs, on par with estimates for Exascale HPC sites. focus the power dissipation on the computing (proces-
Therefore, computing is among heavy consumers of sor, memory, and accelerators) elements. The CMOS
electricity and subject of sustainability considerations computing elements (processor, memory, accelera-
with high societal impact. tors) power dissipation (and therefore also the heat
generation) is characterized by the leakage current. It
For the HPC sector the key contributors to electricity doubles for every 10° C increase of the temperature [8].
consumption are the computing, communication, and Therefore the coolant temperature has influence on
storage systems and the infrastructure including the the leakage current and may be used to balance the
cooling and the electrical subsystems. Power usage ef- overall energy effectiveness of the data centre for the
fectiveness (PUE) is a common metric characterizing applications. We expect that the (pre-)Exascale pilot
the infrastructure overhead (i.e., electricity consumed projects, in particular funded by the EU, will address
in IT equipment as a function of overall electricity). creation and usage of the management software for
Data centre designs taking into consideration sustain- global energy optimization in the facility [9].
ability [2] have reached unprecedented low levels of
PUE. Many EU projects have examined CO2 emissions Beyond Exascale we expect to have results from the re-
in cloud-based services [3] and approaches to optimize search related to the CMOS devices cooled to low tem-
air cooling [4]. peratures [10] (down to Liquid Nitrogen scale, 77 K).
The expected effect is the decrease of the leakage
It is expected that the (pre-)Exascale IT equipment current and increased conductivity of the metallic
will use direct liquid cooling without use of air for connections at lower temperatures. We suggest that
the heat transfer [5]. Cooling with temperatures of an operating point on this temperature scale can be
the liquid above 45° C open the possibility for “free found with significantly better characteristics of the
cooling” in all European countries and avoid energy CMOS devices. Should such operating point exist, a
cost of water refrigeration. Liquid cooling has al- practical way to cool such computational device must
ready been employed in HPC since the earlier Cray ma- be found. This may be one possible way to overcome
chines and continues to play a key role. The CMOSAIC the CMOS technology challenges beyond the feature
project [6] has demonstrated two-phase liquid cooling size limit of 10 nm [11]. We suggest that such research
previously shown for rack-, chassis- and board-level funded in Europe may yield significant advantage
cooling to 3D-stacked IC as a way to increase thermal to the European HPC position beyond Horizon 2020
envelopes. The latter is of great interest especially projects.
for end of Moore’s era where stacking is emerging as
the only path forward in increasing density. Many The electrical subsystem also plays a pivotal role in
vendors are exploring liquid immersion technologies Green ICT. Google has heavily invested in renewables
with mineral-based oil and other material to enable and announced in 2017 that their data centres will be
higher power envelopes. energy neutral. However, as big consumers of elec-
Vertical Challenges 95
tricity, HPC sites will also require a tighter integration No deterministic failure prediction algorithm is
of the electrical subsystem with both the local/global known. However, collecting sensor data and Machine
grids and the IT equipment. Modern UPS systems Learning (ML) on this sensor data yields good re-
are primarily designed to mitigate electrical emergen- sults [17]. We expect that the Exascale machine design
cies. Many researchers are exploring the use of UPS will incorporate sufficient sensors for the failure pre-
systems as energy storage to regulate load on the elec- diction and monitoring. This may be a significant
trical grid both for economic reasons, to balance the challenge, as the number of components and the com-
load on the grid or to tolerate the burst of electricity plexity of the architecture will increase. Therefore,
generated from renewables. The Net-Zero data cen- also the monitoring data stream will increase, leading
tre at HP and GreenDataNet [12] are examples of such to a fundamental Big Data problem just to monitor
technologies. a large machine. We see this monitoring problem as
an opportunity for the EU funding of fundamental
Among existing efforts for power management, the research in ML techniques for real-time monitoring
majority of these approaches are specific to HPC cen- of hardware facilities in general. The problem will not
ters and/or specific optimization goals and are imple- yet be solved in the (pre-)Exascale machine develop-
mented as standalone solutions. As a consequence, ment. Therefore, we advocate a targeted funding for
these existing approaches still need to be hooked up this research to extend beyond Horizon 2020 projects.
to the wide range of software components offered by The traditional failure recovery scheme with the co-
academic partners, developers or vendors. According ordinated checkpoint may be relaxed if fault-tolerant
to [13], existing techniques have not been designed communication libraries are used [18]. In that case
to exist and interact simultaneously on one site and the checkpoints do not need to be coordinated and
do so in an integrated manner. This is mainly due to can be done per node when the computation reaches
a of application-awareness, lack of coordinated man- a well-defined state. When million threads are run-
agement across different granularities, and lack of ning in a single scalable application, the capability
standardised and widely accepted interfaces along to restart only a few communicating threads after a
with consequent limited connectivity between mod- failure is important.
ules, resulting in substantially underutilized Watts
and FLOPS. To overcome the above mentioned prob- The non-volatile memories may be available for the
lems, various specifications and standardization are checkpoints; it is a natural place to dump the HBM con-
currently under development [14, 15] tents. We expect these developments to be explored
on the time scale of (pre-)Exascale machines. It is
clear that the system software will incorporate fail-
6.2 Resiliency ure mitigation techniques and may provide feedback
on the hardware-based resiliency techniques such as
the ECC and Chipkill. The software-based resiliency
Preserving data consistency in case of faults is an has to be designed together with the hardware-based
important topic in HPC. Individual hardware compo- resiliency. Such design is driven by the growing com-
nents can fail causing software running on them to fail plexity of the machines with a variety of hardware
as well. System software would take down the system resources, where each resource has its own failure
if it experiences an unrecoverable error to preserve pattern and recovery characteristics.
data consistency. At this point the machine (or com-
ponent) must be restarted to resume the service from On that note the compiler assisted fault tolerance may
a well-defined state. The traditional failure recovery bridge the separation between the hardware-only and
technique is to restart the whole user application from software-only recovery techniques [19]. This includes
a user-assisted coordinated checkpoint taken at syn- automation for checkpoint generation with the op-
chronization point. The optimal checkpoint period timization of checkpoint size [20]. More research is
is a function of time/energy spent writing the check- needed to implement these techniques for the Exas-
point and the expected failure rate [16]. The challenge cale and post-Exascale architectures with the new lev-
is to guess the failure rate, since this parameter is not els of memory hierarchy and increased complexity of
known in general. If a failure could be predicted, pre- the computational resources. We see here an oppor-
ventive action such as the checkpoint can be taken to tunity for the EU funding beyond the Horizon 2020
mitigate the risk of the pending failure. projects.
96 Vertical Challenges
Stringent requirements on the hardware consistency We cover potential inherent security risks, which arise
and failure avoidance may be relaxed, if an applica- from these emerging memory technologies and on the
tion algorithm incorporates its own fault detection positive side security potentials in systems and appli-
and recovery. Fault detection is an important aspect, cations that incorporate emerging NVMs. Further, we
too. Currently, applications rely on system software to also consider the impact of these new memory tech-
detect a fault and bring down (parts of) the system to nologies on privacy.
avoid the data corruption. There are many application
environments that adapt to varying resource availabil-
ity at service level—Cloud computing works in this 6.3.1 Background
way. Doing same from within an application is much
harder. Recent work on the “fault-tolerant” message- The relevance of security and privacy has steadily in-
passing communication moves the fault detection bur- creased over the years. This concerns from highly
den to the library, as discussed in the previous section. complex cyber-physical infrastructures and systems-
Still, algorithms must be adopted to react construc- of-systems to small Internet of Things (IoT) devices if
tively after such fault detection either by “rolling back” they are applied for security critical applications [24].
to the previous state (i.e. restart from a checkpoint) A number of recent successful attacks on embedded
or “going forward” restoring the state based on the and cyber-physical systems has drawn the interest
algorithm knowledge. The forward action is subject not only of scientists, designers and evaluators but
of a substantial research for the (pre-)Exascale ma- also of the legislator and of the general public. Just a
chines and typically requires algorithm redesign. For few examples are attacks on online banking systems
example, a possible recovery mechanism is based on [25] and malware, in particular ransomware [26] and
iterative techniques exploited in Linear Algebra oper- spectacular cyber attacks to critical infrastructures,
ations [21]. as the Stuxnet attack [27], attacks on an industrial in-
The Algorithm Based Fault Tolerance (ABFT) may also stallation in German steel works [28] and on a driving
use fault detection and recovery from within the ap- Jeep car [29], to name but a few. Meanwhile entire
plication. This requires appropriate data encoding, botnets consisting of IoT devices exist [30]. These ex-
algorithm to operate on the encoded data and the amples may shed light on present and future threats
distribution of the computation steps in the algo- to modern IT systems, including embedded devices,
rithm among (redundant) computational units [22]. vehicles, industrial sites, public infrastructure, and
We expect these aspects to play a role with NMP. The and HPC supercomputers. Consequently, security
ABFT techniques will be required when running ap- and privacy may determine the future market accep-
plications on machines where the strong reliability tance of several classes of products, especially if they
constraint is relaxed due to the subthreshold voltage are increasingly enforced by national and EU-wide
settings. Computation with very low power is possi- legislation [31]. Consequently, security and privacy
ble [23] and opens a range of new “killer app” oppor- should be considered together with (and in certain
tunities. We expect that much of this research will be cases weighted against) the more traditional system
needed for post-Exascale machines and therefore is attributes such as latency, throughput, energy effi-
an opportunity for EU funding beyond the Horizon ciency, reliability, or cost.
2020 projects. Historically, the networks connecting the system with
the outside world and the software running on the
system’s components were considered as a source of
6.3 Impact of Memristive Memories security weaknesses, giving rise to the terms “network
security” and “software security” [32]. However, the
on Security and Privacy
system’s hardware components are increasingly shift-
ing into the focus of attention, becoming the Achilles’
This section discusses security and privacy implica- heels of systems. Researchers have been pointing to
tions of memristive technologies, including emerging hardware-related vulnerabilities since long times, in-
memristive non-volatile memories (NVMs). The cen- cluding side channels [33], fault-injection attacks [34],
tral property that differentiates such memories from counterfeiting [35], covert channels [36] and hard-
conventional SRAM and DRAM is their non-volatility; ware Trojans [37]. Several potential weaknesses in
therefore, we refer to these memories as “NVMs”. hardware components were exposed; some of the
Vertical Challenges 97
widely publicized examples were: counterfeit circuits supply. The first obvious consequence is the persis-
in missile-defense installations in 2011 [38], potential tency of attacks: if the adversary managed to place ma-
backdoors in FPGAs (later identified as undocumented licious content (e.g., software code or manipulated
test access circuitry [39]) in 2012 [40], (hypothetical) parameter values) into a device’s main memory, this
stealthy manipulations in a microprocessor’s secure content will not disappear by rebooting the device or
random number generator in 2013 [41]. Very recently, powering it off. (Of course, to get rid of the malware
two hardware-related security breaches, Meltdown usually additional security measures are necessary.)
[42] and Spectre [43] were presented. They exploit This is in stark contrast to volatile memories where
advanced architectural features of modern micropro- reboot and power-off are viable ways to “heal” at least
cessors and affect several microprocessors that are in the volatile memory of an attacked system; the same
use today. system with an NVM will stay infected.
Meltdown and Spectre are indicative of hardware- The non-volatility can simplify read-out attacks on un-
based attacks on high-performance microprocessors: encrypted memory content. In such attacks, sensitive
On the one hand, it is difficult for an attacker to find data within an electronic component are accessed by
such weaknesses (compared to many conventional an adversary with physical access to the device using
methods, from social engineering to malware and either direct read-out or side-channels, e.g., measur-
viruses), and even when the weaknesses are known ing data-dependent power consumption or electro-
it may be difficult to develop and mount concrete at- magnetic emanations. Usually, volatile memory must
tacks. On the other hand, once such an attack has been be read out in the running system, with the full sys-
found, it affects a huge population of devices. It is also tem speed; moreover the system may be equipped
extremely difficult or may even be impossible to coun- with countermeasures, e.g., tamper-detectors which
teract because hardware cannot easily be patched or would delete the memory content once they identify
updated in field. Corrective actions, which require the the attempted attack. An exception are so-called cold
replacement of the affected hardware components by boot attacks where the memory content may persist
(to be produced) secure versions are usually extremely for several minutes or even hours [44]. An attacker
costly and may even be infeasible in practice. Healing who powered off a system with sensitive data in an
the problem by patching the software that runs on the NVM can analyze the NVM block offline.
component is not always effective and is often asso- It is currently not clear whether emerging memris-
ciated with a barely acceptable performance penalty tive NVMs bear new side-channel vulnerabilities. For
[42]. Consequently, new architectural features should example, many security architectures are based on
undergo a thorough security analysis before being encrypting sensitive information and overwriting the
used. original data in the memory by an encrypted version
or randomness. It is presently not clear whether mem-
In this section, we consider potential implications of
ristive elements within NVMs exhibit a certain extent
emerging memristors, and in particular memristive
of “hysteresis”, which may allow the adversary to re-
non-volatile memories (NVMs) and NVM-based com-
construct the state, which a memory cell had before
puter architectures on security and privacy of systems
the last writing operation with some degree of accu-
(compared to conventional memory architectures).
racy. This property was discussed in [45] from the
We will discuss both: the vulnerabilities of systems
forensic point of view. Whether this vulnerability in-
due to integration of emerging NVMs, and the poten-
deed exists, must be established for each individual
tial of NVMs to provide new security functions and
NVM technology (like STT-RAM or ReRAM) by physi-
features.
cal experiments. If it exists this might allow or at least
support side-channel attacks.
98 Vertical Challenges
attacks. In [46] experiments were not conducted but need for additional non-volatile memory, saving the
the authors announce future experiments. To our copy process of the internal state to non-volatile mem-
knowledge typical side-channel attacks (power, tim- ory, or reseeding upon each power-on. Of course, such
ing, cache etc.) have not been considered so far in the NVM cells must be secure against read-out and manip-
context of NVMs. ulation since otherwise an attacker might be able to
predict all future random numbers. In TRNGs, mem-
Some of the memristive NVMs are also prone to active
ristors might serve as sources of entropy (see e.g. [52]
manipulations, enabling fault attacks. For example, the
and [53]), providing sources for physical RNGs or for
recent paper [47] considers non-invasive magnetic
non-physical non-deterministic RNGs as Linux /de-
field attacks on STT-RAMs, where the adversary over-
v/random, for instance. Whether this use is realistic
rides the values of the cell by applying either a static
depends on the outcome of physical experiments for
or an alternating magnetic field. The authors of [47]
individual memristive technologies. To this end, suit-
note that this attack can be mounted on a running
able random parameters (e.g., the duration of the tran-
system or in passive mode, where it could, e.g., com-
sition between stable states) must be identified; then,
promise the boot process.
a stochastic model (for PTRNGs) or at least a reliable
While all of the mentioned attack scenarios can have lower entropy bound per random bit (for NPTRNGs)
severe consequences already against an adversary must be established and validated, and finally the en-
who has physical access to the end-product, they may tropy per bit must be estimated [54]; see also [55, 56,
be even more dangerous if an attacker manages to 57]. In [52] and [53] the authors focus only on the sta-
compromise the system design and the manufactur- tistical properties of the generated random numbers,
ing process, and was able to insert a Hardware Trojan which are verified by NIST randomness tests.
into the circuitry. Trojans can be inserted during semi-
Another possible memristor-enabled security primi-
conductor manufacturing [48], they can be lurking in
tive could be a Physically Unclonable Function (PUF).
third-party intellectual-property cores [49], and even
A PUF is a “fingerprint” of an individual circuit in-
CAD tools used for circuit design may plant Trojans
stance among a population of manufactured circuits
[50]. Emerging NVMs might facilitate both the estab-
[58]. It should reliably generate a unique, circuit-
lishment of Trojans in the system (e.g., by placing
specific bitstring, and it shall be impossible to pro-
their trigger sequences in a non-volatile instruction
duce another circuit with the same fingerprint. PUFs
cache) and also multiply the damaging potential of
are used for on-chip generation of secret keys and for
Trojans.
authentication protocols, for instance, but also for
tracking circuits and preventing their counterfeiting
6.3.3 Memristors and Emerging NVMs: [59]. PUFs based on memory cells are well-known [60],
and these insights can perhaps directly be applied to
Supporting Security
emerging NVMs [61]. However, the emerging near-
memory and in-memory concepts where NVMs are
On the positive side, memristors can be the basis for
tightly coupled with logic, create potentials for richer
security primitives that are difficult or expensive to
varieties of PUF behavior, such as “strong PUFs” which
realize technically by conventional hardware and soft-
support challenge-response authentication protocols
ware. Depending on the scenario one such primitive
[62]. A strong PUF proposal based on memristive el-
might a random number generator (RNG), which is useful,
ements has been proposed in [63]. Moreover, it was
for instance, for on-chip generation of secure crypto-
suggested to leverage non-linearity of memristors to
graphic keys, signature parameters, nonces and for
define “public PUFs” which overcome certain deficien-
creating masks to protect cryptographic cores against
cies of traditional PUFs [64].
side-channel analysis. Roughly speaking, RNGs can be
divided into deterministic RNGs (DRNGs) (a.k.a. pseu- An interesting question might be whether emerging
dorandom number generators) and true RNGs. The memristive cells and NVM-enabled architectures are
class of true RNGs can further be subdivided into phys- better or worse protected against counterfeiting and
ical RNGs (PTRNGs, using dedicated hardware) and reverse engineering compared to conventional circuits.
non-physical trueRNGs (NPTRNGs) [51]. Memristors On the one hand, the designer can replace identifi-
and NVMs on their basis might be beneficial for both able circuit structures by a regular fabric similar to
DRNGs and true RNGs. For DRNGs, NVMs might be reconfigurable gate-arrays that is controlled by val-
used to store the internal state, thus reducing the ues stored in an NVM. This makes it difficult for an
Vertical Challenges 99
attacker to apply the usual flow to reconstruct the [3] The ECO2Clouds Project. 2017. url: http : / / eco2clouds .
circuit functionality: depackage the circuit, extract its eu/.
individual layers, and apply optical recognition to find [4] The CoolEmAll Project. 2017. url: https://www.hlrs.de/
logic gates, memory cells, interconnects, and other about-us/research/past-projects/coolemall/.
structures. In fact, if the content of the “configuration” [5] The DEEP-ER Project. 2017. url: http://juser.fz-juelic
NVM cells is lost during deprocessing, its functional- h.de/record/202677.
ity is irretrievably lost as well. Possibly, attackers may [6] The CMOSAIC Project. 2017. url: http://esl.epfl.ch/
find ways to read out the values in memristive ele- page-42448-en.html.
ments prior to deprocessing. In addition, memristors [7] TUE and iTUE. 2017. url: http://eehpcwg.lbl.gov/sub-
groups/infrastructure/tue-team.
can power anti-counterfeiting solutions, like PUFs. As
with other security attributes, the resistance of cir- [8] D. Wolpert and P. Ampadu. Managing Temperature Effects in
Nanoscale Adaptive Systems. 2012. 174 pp. doi: 10.1007/978-
cuits to reverse engineering is a cat-and-mouse game 1-4614-0748-5.
where the defender invents new protections and the
[9] S. Li, H. Le, N. Pham, J. Heo, and T. Abdelzaher. “Joint Opti-
attacker finds way around this protection; NVMs could mization of Computing and Cooling Energy: Analytic Model
substantially change the rules of this game. and a Machine Room Case Study”. In: 2012 IEEE 32nd Inter-
national Conference on Distributed Computing Systems. 2012,
pp. 396–405. doi: 10.1109/ICDCS.2012.64.
6.3.4 Memristors, Emerging NVMs and [10] M. J. Ellsworth. The Challenge of Operating Computers at Ultra-
Privacy low Temperatures. 2001. url: https://www.electronics-
cooling.com/2001/08/the-challenge-of-operating
-computers-at-ultra-low-%20temperatures/.
Privacy stands in a non-trivial relationship with se-
[11] J. Hu. Low Temperature Effects on CMOS Circuits. 2017. url:
curity, and therefore security implications of mem-
http : / / users . eecs . northwestern . edu / ~jhu304 /
ristors can have positive or negative consequences files/lowtemp.pdf.
for privacy [65]. On the one hand, security breaches
[12] The GreenDataNet Project. 2017. url: http://www.greenda
that lead to unauthorized access to user data (e.g., tanet-project.eu./.
leaked secret keys), or compromise their authentic- [13] “A Strawman for an HPC PowerStack”. In: OSTI Technical
ity and integrity, are clearly detrimental for privacy Report (Aug. 2018). url: https://hpcpowerstack.githu
(loss of privacy or of availability). To this end, all prop- b.io/strawman.pdf.
erties of NVMs that simplify attacks on encryption [14] url: https://hpcpowerstack.github.io/.
negative privacy impact, and all beneficial features [15] url: https://www.osti.gov/biblio/1458125.
of NVMs, e.g., schemes (e.g., read-out attacks or new [16] J. S. Plank and M. G. Thomason. “Processor Allocation
side-channel attacks) have generation of secure secret and Checkpoint Interval Selection in Cluster Computing
keys, have positive consequences. Here, security and Systems”. In: J. Parallel Distrib. Comput. 61.11 (Nov. 2001),
privacy requirements are consistent. pp. 1570–1590. doi: 10.1006/jpdc.2001.1757.
[17] D. Turnbull and N. Alldrin. Failure prediction in hardware
Security and privacy may get in conflict when it comes systems. Tech. rep. University of California, San Diego, 2003.
to methods which track in an undesired and unneces- url: http : / / cseweb . ucsd . edu / ~dturnbul / Papers /
sary way individual circuit instances, e.g., by storing ServerPrediction.pdf.
a unique identifier in an on-chip NVM, or by creating [18] G. E. Fagg and J. J. Dongarra. “FT-MPI: Fault Tolerant MPI,
such an identifier using a PUF. This functionality is Supporting Dynamic Applications in a Dynamic World”.
In: Recent Advances in Parallel Virtual Machine and Message
beneficial for security and in particular to prevent
Passing Interface: 7th European PVM/MPI Users’ Group Meeting.
counterfeiting or overbuilding [59]. 2000, pp. 346–353. doi: 10.1007/3-540-45255-9_47.
[19] T. Herault and Y. Robert. Fault-Tolerance Techniques for High-
Performance Computing. 2015. 320 pp. doi: 10.1007/978-3-
References 319-20943-2.
[1] L. Smarr. “Project GreenLight: Optimizing Cyber- [20] J. S. Plank, M. Beck, and G. Kingsley. “Compiler-Assisted
infrastructure for a Carbon-Constrained World”. In: Memory Exclusion for Fast Checkpointing”. In: IEEE Techni-
Computer 43.1 (Jan. 2010), pp. 22–27. doi: 10 . 1109 / MC . cal Committee on Operating Systems and Application Environ-
2010.20. ments 7 (1995), pp. 62–67.
[2] J. Shuja, A. Gani, S. Shamshirband, R. W. Ahmad, and K. [21] J. Langou, Z. Chen, G. Bosilca, and J. Dongarra. “Recovery
Bilal. “Sustainable Cloud Data Centers: A survey of enabling patterns for iterative methods in a parallel unstable en-
techniques and technologies”. In: Renewable and Sustainable vironment”. In: SIAM Journal on Scientific Computing 30.1
Energy Reviews 62.C (2016), pp. 195–214. (2007), pp. 102–116.
[32] C. Eckert. IT-Sicherheit - Konzepte, Verfahren, Protokolle. 6th ed. [50] M. Potkonjak. “Synthesis of trustable ICs using untrusted
2009. CAD tools”. In: Design Automation Conference. 2010, pp. 633–
634.
[33] S. Mangard, E. Oswald, and T. Popp. Power analysis attacks –
Revealing the secrets of smart cards. 2007. [51] W. Schindler. “Random number generators for crypto-
graphic applications”. In: Cryptographic Engineering. 2009,
[34] A. Barenghi, L. Breveglieri, I. Koren, and D. Naccache. “Fault pp. 5–23.
injection attacks on cryptographic devices: Theory, prac-
tice, and countermeasures”. In: Proceedings of the IEEE 100.11 [52] C. Huang, W. Shen, Y. Tseng, C. King, and Y. C. Lin. “A
(2012), pp. 3056–3076. Contact-resistive random-access-memory-based true ran-
dom number generator”. In: IEEE Electron Device Letters 33.8
[35] F. Koushanfar, S. Fazzari, C. McCants, W. Bryson, M. Sale, (2012), pp. 1108–1110.
and M. P. P. Song. “Can EDA combat the rise of electronic
counterfeiting?” In: DAC Design Automation Conference 2012. [53] H. Jiang et al. “A novel true random number generator
2012, pp. 133–138. based on a stochastic diffusive memristor”. In: Nature Com-
munications 8 (2017). doi: 10.1038/s41467-017-00869-x.
[36] Z. Wang and R. B. Lee. “Covert and side channels due to
[54] W. Schindler. “Evaluation criteria for physical random
processor architecture”. In: ASAC. 2006, pp. 473–482.
number generators”. In: Cryptographic Engineering. 2009,
[37] S. Bhunia, M. S. Hsiao, M. Banga, and S. Narasimhan. “Hard- pp. 25–54.
ware Trojan attacks: Threat analysis and countermeasures”.
[55] AIS 20: Funktionalitätsklassen und Evaluationsmethodologie für
In: Proceedings of the IEEE 102.8 (2014), pp. 1229–1247.
deterministische Zufallszahlengeneratoren. Version 3. https:
[38] D. Lim. Counterfeit Chips Plague U.S. Missile Defense. Wired. //www.bsi.bund.de/SharedDocs/Downloads/DE/BSI/
2011. url: https://www.wired.com/2011/11/counterf Zertifizierung/Interpretationen/AIS_20_pdf.pdf.
eit-missile-defense/. May 2013.
[39] S. Skorobogatov. Latest news on my Hardware Security Re- [56] AIS 31: Funktionalitätsklassen und Evaluationsmethodologie für
search. url: http://www.cl.cam.ac.uk/~sps32/sec_ physikalische Zufallszahlengeneratoren. Version 3. https://
news.html. www.bsi.bund.de/SharedDocs/Downloads/DE/BSI/
Zertifizierung/Interpretationen/AIS_31_pdf.pdf.
May 2013.