0% found this document useful (0 votes)

17 views

Developing A Multicore Platform Utilizing Open RISC-V Cores

Uploaded by

Sai Mithil

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views

Developing A Multicore Platform Utilizing Open RISC-V Cores

Uploaded by

Sai Mithil

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Received June 30, 2021, accepted August 25, 2021, date of publication August 27, 2021, date of current

version September 7, 2021.

Digital Object Identifier 10.1109/ACCESS.2021.3108475

Developing a Multicore Platform Utilizing

Open RISC-V Cores
HYEONGUK JANG 1,2 , KYUSEUNG HAN 1 (Member, IEEE), SUKHO LEE1 , JAE-JIN LEE 1,2 ,
SEUNG-YEONG LEE3 , (Student Member, IEEE), JAE-HYOUNG LEE3 , (Student Member, IEEE),
AND WOOJOO LEE 3 , (Member, IEEE)
1 Electronicsand Telecommunications Research Institute, Daejeon 34129, South Korea
2 Department of ICT, University of Science and Technology, Daejeon 34113, South Korea
3 School of Electrical and Electronics Engineering, Chung-Ang University, Seoul 06974, South Korea

Corresponding author: Woojoo Lee (space@cau.ac.kr)

This work was supported in part (50%) by the Ministry of Science, ICT (MSIT), South Korea, through the Development of Ultra-Low
Power Intelligent Edge SoC Technology Based on Lightweight RISC-V Processor supervised by the Institute for Information and
Communications Technology Planning and Evaluation (IITP) under Grant 2018-0-00197, and in part (50%) by Chung-Ang University
Research Grants in 2019.

ABSTRACT RISC-V has been experiencing explosive growth since its first appearance in 2011. Dozens
of free and open cores developed based on this instruction set architecture have been released, and RISC-V
based devices optimized for specific applications such as the IoT and wearables, embedded systems, AI, and
virtual, augmented reality are emerging. As the RISC-V cores are being used in various fields, the demand
for multicore platforms composed of RISC-V cores is also rapidly increasing. Although there are various
RISC-V cores developed for each specific application, and it seems possible to pick them up to create
the most optimized multicore for the target application, unfortunately it is very difficult to realize this in
reality. This is mainly because most open cores are released in the form of a single core without cache
coherence logic, which requires expensive design effort and development costs to address it. To tackle this
issue, this paper proposes a method to solve the cache coherence problem without additional effort from the
developer and to maximize the performance of the multicore composed of the RISC-V core selected by the
developer. Along with a description of the sophisticated operating mechanisms of the proposed method, this
paper details the architecture and hardware implementation of the proposed method. Experiments conducted
through the prototype development of a RISC-V multicore platform involving the proposed architecture and
development of an application running on the platform demonstrate the effectiveness of the proposed method.

INDEX TERMS Multicore platform, RISC-V, system-on-chip (SoC), electronic design automation (EDA).

I. INTRODUCTION users to produce custom chips suited to specific applications.

Instruction set architecture (ISA) is the essential vocabulary As Linux gained popularity and acclaim in the operating sys-
that allows hardware and software to communicate [1]. Over tems, RISC-V pursues to become Linux in the processors [4],
the past two decades, two major companies, ARM and Intel, and is beginning to be used in various commercial products
have dominated ISA, and as a result, their microprocessors one after another.
are now embedded in all computing devices from smallest to As RISC-V is expected to be used in the design of new
the fastest. However, after the recent rise of the RISC-V ISA and more specialized processor cores that will soon emerge
[2], all of this is changing, and the microprocessor industry in wearables, home appliances, robots, autonomous vehicles
is turning upside down [3]. The RISC-V is a free and open and factory equipment, the need for RISC-V based multicore
instruction set with well-structured modularity, providing a platforms is becoming increasingly urgent. Currently, there
very high level of flexibility at a very low cost and allowing are various types of RISC-V cores that have been released,
and by using them, it is ideally possible to configure them
The associate editor coordinating the review of this manuscript and as customized multicores for various applications. However,
approving it for publication was Songwen Pei . in terms of practicality, building a multicore platform using

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
120010 VOLUME 9, 2021
H. Jang et al.: Developing Multicore Platform Utilizing Open RISC-V Cores

FIGURE 1. Example of the cache coherence problem.

the open RISC-V cores, especially released as a single core, FIGURE 2. Solutions to the cache coherence problem: (a) simply
poses unfortunately enormous challenges. This is mainly due non-caching the shared data and (b) using CCL.
to the cache coherence problem. Without solving this prob-
lem, the correct operation of the multicore is not guaranteed, The main contributions of this paper may be summarized
or the expected performance is not exhibited. The currently as follows:
possible method to solve this problem is to implement the
cache coherence logic (CCL) for each multicore by the plat- – As the most practical solution to the cache coherence
form developers themselves, but this has a critical limitation problem in multicore development with RISC-V cores,
that the design effort and development costs are very expen- a temporary caching (TC) method is presented.
sive and sometimes impossible. – Along with the sophisticated operation mechanism of
To address the difficulty of developing multicore platforms the proposed TC, a detailed description of the hardware
utilizing RISC-V cores, we propose a method that solves and software development for TC is provided.
the cache coherence problem without CCL and maximizes – Through prototyping of a RISC-V-based multicore plat-
the performance of multicore regardless of which RISC-V form to which TC is applied and the development of the
cores are used in the multicore platform. The main idea of application running on this platform, the effectiveness of
the proposed method is to avoid cache coherence problem by the proposed solution is verified.
disallowing caching on shared data by default, and to allow The remainder of this paper is organized as follows.
temporarily caching on data that are obviously not shared for Section II elucidates the cache coherence problem that
a period of time in order to compensate for the performance can occur when a multicore is configured based on the
degradation caused by the inability to use the cache. We put RISC-V cores, and the existing solutions associated with
it into the role of the programmers to determine which data them. Section III introduces the main idea of the proposed
are temporarily cached, but provide simple application pro- method and discusses problems that may arise from the
gramming interface (API) functions to make it easier for the proposed method. Next in Section IV, a detailed descrip-
programmers to apply this method when developing application of the architecture and hardware implementation for the
tions. In addition, we analyze and identify problems that may proposed method are presented. Implementing the proposed
arise when the proposed method is applied to existing cache architecture on NoC and developing it to be automati-
structure, and devise sophisticated behavior mechanisms of cally designed are provided in Section V. Section VI is
the proposed method to address them. Next, we develop to develop a prototyped RISC-V multicore platform and
the architecture to realize the proposed method in RISC-V applications as test benches and to provide experimental
core-based multicore platform, and implement the necessary results obtained from them. Finally, Section VII concludes the
hardware. We then build the proposed architecture into a paper.
network-on-chip (NoC) responsible for IP-to-IP communica-
tion in system-on-chip (SoC), so even if the developer selects II. CACHE COHERENCY PROBLEM IN RISC-V MULTICORE
any RISC-V cores and configures multicores, the proposed Cache coherency problem is a well-known problem on
method can be applied. Moreover, by including the proposed multicore due to the caches being distributed across indi-
architecture in the RISC-V-based SoC automatic design tool, vidual cores. Since each core has its own cache, the copy
we try to increase the usability of the proposed method. of the shared data in that cache may not always be the
Finally, to verify the effectiveness of the proposed method, most up-to-date version, resulting in data synchronization
we implement the RISC-V multicore prototype platform failures that possibly crash the program or the entire com-
including the proposed architecture on an FPGA, and develop puter. FIGURE 1 shows a simple example of the cache
a camera input-based handwriting recognition program as coherency problem. In the figure, there is a dual-core proces-
an application. Through the experimental work based on sor with Core1 and Core2, where each core brought a memory
the application running on the FPGA, we confirm that the block for the variable A into its private cache. And then
performance of the platform on which the proposed method Core2 writes 0 to A. When Core1 attempts to read A from its
is applied shows a performance improvement of about 37% cache, it will not have the latest version, producing incorrect
over those on which it is not. results.

VOLUME 9, 2021 120011

H. Jang et al.: Developing Multicore Platform Utilizing Open RISC-V Cores

TABLE 1. Features of the existing RISC-V cores.

FIGURE 3. Example of (a) conceptual diagram of temporary caching in

parallel processing and (b) its code.

To tackle the cache coherence problem in multicore,

tremendous research efforts have been continuing for
decades, and various solutions have been proposed. These
solutions can be divided into software-based and hardware-
based schemes. The software-based schemes refer to
approaches of caching and maintaining data coherency in
software by analyzing shared data [5]–[9]. These schemes
mainly solve the cache coherence problem by improving
the compiler, and sometimes by requiring special hardware
assist. Unfortunately, however, a compiler that completely
solves the cache coherence problem has not yet appeared on
the market [10].
The software-based scheme that can be used in practice
is to allocate all the shared data used by multiple cores
into a non-cacheable region at compile time. Then the cores
read the shared data directly from the main memory without
caching, which is described in FIGURE 2 (a). This scheme
has advantages in terms of practicality because the system
developer does not need to modify the existing compiler, and
it has the advantage in terms of programmability because
the program works correctly even if the software developer
does not consider the cache. Of course, the speed at which
the core accesses shared data is slowed, so performance
degradation is unavoidable with this scheme. In particular,
in the case of applications that have a lot of data access,
such as image processing, performance can be greatly
degraded.
Due to the shortcomings of software-based schemes
in terms of performance, hardware-based schemes are
widely used in typical multicore systems. By utilizing addi-
tional hardware to synchronize the data in the caches,
which is called CCL (cache coherence logic) as shown
in FIGURE 2 (b), the schemes achieve the high performance
of multicore platforms [10]–[13]. However, since the CCL is
closely related to the cache structure, adding the CCL to an
already-designed open source core is very expensive in design a multicore platform using RISC-V cores, TABLE 1 lists
effort and development cost unless the CCL is considered and existing RISC-V cores. As shown in the table, most of the
designed together when designing the core. Moreover, it is RISC-V cores were released in the form of a single core
very difficult and impractical to implement CCL that targets without CCL. In order to configure a multicore with them
several different cores rather than one kind of core. while using a hardware-based scheme, platform developers
Based on the above discussion, it may be very hard to solve have no choice but to implement the CCL by themselves.
the cache coherence problem by using the hardware-based In the worst case, some cores do not provide readable RTL
scheme to develop a multicore platform using the RISC-V code, making it impossible to add the CCL. In addition,
cores. To more realistically examine the development of there are some RISC-V cores released as multicores and

120012 VOLUME 9, 2021

H. Jang et al.: Developing Multicore Platform Utilizing Open RISC-V Cores

FIGURE 4. Memory access over time on Core1 for array A, when the TC is
(a) not applied and (b) applied.

they support CCL, but these RISC-V cores have a big limit
on scalability because they have a fixed number of cores.
Furthermore, there may be cases where platform developers
want to use existing RISC-V cores to construct heterogeneous
multicores, even in these cases, developers are still faced with
the problem of having to implement their own CCL. After all, FIGURE 5. Progamming example.
platform developers want to choose the most suitable RISC-V
core from the list, but the reality is that they will have a hard
time building a multicore platform by scaling regardless of
the cache coherence type.

III. TEMPORARY CACHING

A. MAIN IDEA
In this paper, as the most practical solution to develop mul-
ticore with RISC-V cores, we propose a new software-based
scheme that can compensate for the performance degradation
of the conventional software-based schemes without devel-
oping a new compiler or losing programmability. FIGURE 3
shows the motivation and main idea of the proposed scheme.
In software-based schemes, when a program has a consec- FIGURE 6. Mapping mechanism between memory map and main memory
utive array and the array is generally shared data that can for TC.
cause cache coherence problem, caching that array is strictly
not allowed. However, as shown in the figure, if the array TC can be applied frequently, the benefits of TC can be
can be split into multiple pieces within a loop statement greatly appreciated.
and can be executed independently on each core, performance In order for programmers to apply TC easily, we develop
can be improved if the programmer can temporarily allow an API with tc_malloc function to start TC and tc_free
caching of this array during the loop statement. Of course, function to end TC. FIGURE 5 shows an example of how to
after the loop is over, the caching for that array should be apply TC to the original program code using the provided
disabled. API. In the original code, shown in FIGURE 5 (a), there
In other words, the main idea of the proposed scheme is to is a variable a. When TC can be applied to the variable a
allow the programmer to temporarily cache data when possi- in the program, as shown in the 5th line of FIGURE 5 (b),
ble, i.e. we call this technique TC (temporary caching). The the programmer calls tc_malloc with the start address and
proposed TC improves performance by making it possible size of a as parameters. Then, the starting address of the new
to dynamically cache data that originally had to be accessed variable x that can be cached while having the same data as
from the main memory. For example, if array A is shared variable a is returned. More in detail, as shown in FIGURE 6,
data as shown in FIGURE 4 (a), it must be accessed directly tc_malloc creates x allocated the same size as a in the
from the main memory. However, if A is accessed only by space called TC heap in the cacheable region. For reference,
Core1 in a certain time, the programmer can program A to the TC heap is secured from address space in the cacheable
be cached for that time. FIGURE 4 (b) conceptually shows region in the memory map that is not actually mapped to
that the application of TC reduces the memory access time. memory or MMIO and not for the space for instruction data
In applications such as deep neural network operations and and read-only data. Then, tc_malloc dynamically sets the
image processing where there is a lot of memory access and memory map so that this x is mapped to the space of the main

VOLUME 9, 2021 120013

H. Jang et al.: Developing Multicore Platform Utilizing Open RISC-V Cores

cache significantly affects the TDP of the overall system [23],

[24]. After all, when applying TC, the programmer must
optimize the target application with these factors in mind.
Fortunately, TC is a software based scheme, so programmers
can easily do this by trial and error using the provided TC
API. In addition, compared to the large power overhead of the
CCL [25], [26], which may adversely affect the TDP, TC is
advantageous for TDP because the CCL is not required.

B. LIMITATION DUE TO THE CACHE-LINE PROBLEM

In a function, shared data can be divided into N number of
short-term private data and a short-term shared data, where N
is the number of cores. Each private data can be temporarily
cached by each core, and we refer to these data collectively as
TC data, and define TCx data by attaching the index x of the
corresponding core to each. The short-term shared data still
accessed from main memory is called non-TC data. Then,
noting that data transfer between cache and main memory
is basically done on a cache-line unit, we can notice that a
fatal problem can occur if two or more types of TCx data or
non-TC data are on the same cache line. We call this problem
the cache-line problem and continue its detailed analysis.
First, when TC1 data is transferred to the cache, non-TC
data just located near the TC1 data can also be cached, result-
ing in the cache-line problem. In this case, if the core attempts
to access non-TC data, the data will be accessed from the
cache and not from the main memory. FIGURE 7 (a) shows a
detailed example of this problem, which can eventually cause
critical system errors.
FIGURE 7. Problematic cases when adopting TC.
Next, when two different TC data belonging to the
same cache-line, another cache-line problem can also occur.
memory mapped with a. After that, instead of using a, x is FIGURE 7 (b) describes this case, where variable A and B are
used in the code, allowing the core to fetch the same data as a TC1 and TC2 data, respectively. If A and/or B are modified
from the main memory, put it in the cache, and use it. Finally, and written back to the main memory, the wrong value can be
when it is no longer necessary or possible to cache the data, stored in the main memory, due to the unintentionally-cached
the programmer calls the tc_free with the starting address data in each cache. This also can cause a fatal system error,
of x as a parameter, and tc_free flushes the cached TC and there is no existing solution to prevent it.
data from the cache, frees the space allocated for x from the The cache line problem puts a big limit on the use of
TC heap, and cut off the mapping to the corresponding main TC. For example, as shown in the 1st line of code in
memory. FIGURE 8 (a), array a is shared, so it should not be cached
Meanwhile, when applying TC, programmers must on the dual core platform without CCL basically. On the other
take into account real-time computing, reliability, and hand, as seen in the next for-loop in both codes, a is actually
power/energy consumption resulting from cache usage. More used independently in each core, whereby a[0∼49] and
precisely, the use of cache in shared data can cause cache a[50∼99] are executed on each core. Therefore, to improve
interference issues between tasks, which can significantly performance, it is desired to apply TC for a in each code,
hamper the predictability and analysis of multicore real-time which is described in FIGURE 8 (b). However, since there
systems [14], [15]. Recent studies on cache architecture and is a high possibility that some cache-lines of a[0∼49]
cache coherence show that they have a significant impact and a[50∼99] overlap, the programmer must never apply
on system reliability [16], [17]. Additionally, cache archi- TC as in the example in FIGURE 8 (b). In other words,
tecture and operational policies are well known to have the programmer must conservatively apply TC only to data
a significant impact on overall system power and energy, that clearly does not share the cache-line, which is a huge
so optimizing them has been intensively studied for more constraint on the use of TC.
than a decade [18]–[20]. Furthermore, as the power density
of chips increases, thermal design power (TDP) has become C. PLAUSIBLE SOLUTIONS
an important concern in modern chip designs [21], [22], and To overcome the limitation of using TC, one can come up with
some studies have pointed out that the leakage current of the a method of using a lock mechanism with TC. For example,

120014 VOLUME 9, 2021

H. Jang et al.: Developing Multicore Platform Utilizing Open RISC-V Cores

FIGURE 9. Example of assigning a new variable to TC Heap: the TC Heap

is from 0 × 0 to 0xFFF, the cache-line size is 0 × 20, and the start address
and size of the original variable are 0 × 10001068 and 0 × 40,
respectively. The green part is the space allocated for the new variable,
and the blue part is for the actual TC data.

performance, as it incurs memory resource waste and signif-

icant time overhead for copying data. Meanwhile, instead of
the software-style approaches discussed above, we may think
of a hardware-based solution that supports variables whose
addresses are unaligned to the cache-line size. This method is
ideally possible, but none of the existing core architectures,
including the RISC-V cores, support this structure.
After all, all of the above solutions have fatal weaknesses.
In particular, the problem is exacerbated by the inability to
modify the RISC-V cores themselves. Under the conclusion
that a solution based on software or hardware alone is dif-
ficult, we try to solve the cache-line problem through an
approach that considers both software and hardware. Further-
more, to find the most practical solution, we considered the
following issue in the software/hardware co-design approach:
no matter how easily the developed software is available
FIGURE 8. Example codes for the false sharing problem.
on the platform, if it is difficult to configure the platform
using the necessary hardware with the software, this cannot
be a practical solution. In the following sections, we will
as shown in the 5th and 12th lines in FIGURE 8 (c), a program- introduce our software/hardware co-design solution in detail
mer codes to lock and unlock before tc_malloc and after and explain how to make this solution the most practical by
tc_free, respectively, so that a[0∼49] and a[50∼99] implementing a way to automatically generate a multicore
are cached, updated, and flushed independently. In this way, platform with the proposed hardware.
performance is improved in terms of data access speed due
to TC, but since programs are sequentially processed by IV. TEMPORARY CACHING ARCHITECTURE
lock, significant performance loss may occur in terms of data A. OVERVIEW
parallel processing, which may lead to overall performance The cache-line problem can be solved by making the
degradation. addresses of TC and non-TC data not consecutive. To do
Data allocation by the compiler is also a plausible approach that, we introduce the concept of virtual addresses. The use
to think about, but in the end it is not appropriate. For of virtual addresses has the same effect as the copy-based
example, one might think that inserting the proper padding solution discussed in Section III-C, which copies TC data
between a[49] and a[50] would avoid the conflict, but to a new memory location, but the copy overhead can be
this is only possible if the address of the array elements avoided by mapping a virtual address to the original TC data.
is linear, which is not possible in reality. On the other To realize this concept, it is necessary to develop system
hand, allocating one element per cache line size at compile software that allocates virtual addresses and hardware that
time will definitely prevent the two TC groups from mix- supports address translation, which is one of the major topics
ing into the cache line. However, this approach not only in this section.
increases memory usage extremely, but also significantly Unfortunately, the introduction of virtual addresses alone
reduces performance by removing the spatial locality of the cannot solve the cache-line problem because cache lines still
cache. have unintentionally-cached dummies. In other words, using
We may also consider a way to fundamentally block two a virtual address prevents the use of dummies, but cannot
or more TCx data and/or non-TC data belonging to one prevent them from being included in the cache-lines. There
cache-line by copying each TC data to new data and using seems to be no problem because read/write does not directly
it. However, this method violates the aim of TC to improve take place on such unintentionally-cached dummies during

VOLUME 9, 2021 120015

H. Jang et al.: Developing Multicore Platform Utilizing Open RISC-V Cores

the program execution. However, the moment a write to TC or write-through. The corresponding TC-MMU registers are
data occurs, the entire cache line containing the TC data is also initialized, eliminating the mapping between the new
changed to dirty, which means dummies can also be written variable and original data in the memory map. tc_free
to the memory during the write-back process, resulting in also contains a garbage collection process since the mem-
inconsistency. As a solution to this problem, we propose a ory space for the TC heap is not infinite. The process can
method that uses a speculative approach to cache all data in be implemented simply without any complex algorithm by
the cache line containing TC data and ensure that these data tracking the number of alive TC variables in the two APIs;
work correctly when they are written back to memory, which tc_malloc increases the number and tc_free decreases
is another major topic in this section. it. If the number becomes zero after the decrement, tc_free
initializes a TC heap pointer, which is the variable to assign
B. VIRTUAL ADDRESS MAPPING the next virtual address, to the start address of the TC heap.
Virtual address mapping requires three components, the This initialization almost always takes place at the end of
TC-MMU, TC heap, and TC APIs. First, the TC-MMU is the function, so tc_malloc can now repeatedly issue the
a hardware unit that translates virtual address to physical virtual address.
address. It is similar to MMU for page table processing, but
much simpler since direct translation is its only function. C. BYTE LEVEL MANAGEMENT
Next, the system software prepares the TC heap at compile The basic idea to prevent the side effect of unintentionally-
time, which is a memory space in the cacheable region but cached dummies is as follows: we stick with the traditional
does not contain actual data. It has a start address and an way that cache data is moved on a per-cache-line basis, but
end address, but it does not include a compiled binary that when data from the cache is written back to main memory,
is identical to the original heap used for dynamic memory the dummies must not be written back to the main memory.
allocation. Lastly, the TC API functions, tc_malloc and For example, FIGURE 10 shows an example of the proposed
tc_free, are designed to perform the virtual address map- method. In the figure, each TC data to is displayed in blue
ping internally. tc_malloc issues a virtual address when a and orange, respectively, and the gray areas indicate the
physical address and the size of the target variable are given. unintentionally-cached dummies. As seen in the figure, when
When a programmer calls this function with the two parame- writing back the data in main memory, only the blue and
ters, the function allocates a specified amount of memory in orange areas excluding the gray areas should be written.
the TC heap and returns its address. At the same time, it also To realize this idea, we propose a byte-level management
configures the TC-MMU with the original address and new method that exploits the byte enable signal, which is used to
address by writing some registers. The description of these determine the validation of data in the conventional bus pro-
registers is given in Section IV-D, which presents detailed tocol [27]. In the bus protocol, data read or write is performed
description of the hardware for TC. in units of data bus width, which is the size of data transferred
During the heap allocation, the issued address must be per clock. If data smaller than the data bus width is transferred
aligned to the size of the cache line. This is to prevent for the write operation, an unintended value may be written
problems that may occur due to cache-line overlap between to the memory. To prevent this, a byte enable bit is placed for
different TC data in the TC heap. Moreover, we allocate each byte in a byte lane to determine the validity of the data.
memory in the way that the cache-line-offset of the virtual Therefore, as many byte enable bits are used as the number of
address is the same as that of the original physical one. The bytes of the data bus width. Meanwhile, in the case of a read
cache-line-offset refers to a value of the lower bits that are operation, when reading data smaller than the data bus width,
smaller than the size of the cache-line among all address no byte enable bit is needed because there is no problem even
bits. This approach will reduce the complexity of address if the data is not used except for the required part by taking
translation logics in TC-MMU. the data as much as the data bus width.
FIGURE 9 shows an example of assigning a new variable In the proposed byte level management, only the byte
to the TC heap, where the start address and the size of the enable bits of the part corresponding to the TC data are set
variable are 0 × 10001068 and 0 × 40, respectively, and the to 1, and the rest are set to 0. For instance, in FIGURE 10,
size of the cache-line is 0 × 20. As shown by the blue part in only the byte enable bit corresponding to the blue and orange
this figure, the start address of the new variable becomes 0×8 data becomes 1, the rest becomes 0, and then only the
and the last address becomes 0 × 47, and the space allocated part with the corresponding byte enable bits 1 is written
to the variable is larger than that, which is from 0×0 to 0×60 back to main memory. Along with the byte-level manage-
as shown in green in the figure. ment mechanism, we have designed optimized hardware
When it is no longer possible to apply TC, the programmer for this, and thanks to this hardware support, the system
executes the tc_free to flush the TC data in the cache itself ensures that there are no data inconsistency due to the
and to release the allocated space for the variable in the unintentionally-cached dummies, allowing the programmer
TC heap. Since other variables can be allocated to the same to actively and conveniently use TC without worrying about
address of current TC data in the future, tc_free must the cache-line problem. A detailed description of the designed
invalidate all the caches whether cache policy is write-back hardware is provided in the following section.

120016 VOLUME 9, 2021

H. Jang et al.: Developing Multicore Platform Utilizing Open RISC-V Cores

the AR/AW channel, the Matched Decision Logic in the

TC entry compares the start address and the last address
stored in the registers to the input address, and generates a
Matched signal that can determine whether this address is
the address of the variable to which TC is applied or not.
At the same time, in the TC entry, the target address indicating
the address of the main memory to which the TC variable
FIGURE 10. Proposed solution to the problem of unintentionally-cached is mapped is calculated by adding the value of the offset
dummies.
register to the input address. That is, the target address in the
TC entry is calculated regardless of whether it is matched,
D. TEMPORARY CACHING UNIT but the target address finally becomes the output of the TC
We have developed hardware that supports the virtual address entry only when the Matched signal is 1 (cf. the Mux in
mapping (i.e., TC-MMU) and byte level management func- the TC entry uses the Matched signal as a selecting signal),
tions, which we call a temporary caching unit (TCU). The otherwise, the original input address will be the output of
TCU is written in Verilog hardware description language the TC entry. Then, a bitwise OR operation is performed on
at register-transfer level, and is verified on a Xilinx FPGA. the Matched signals of all the TC entries, and the result is
We first paid attention to the bus interface for communication called the TC signal. This TC signal is used as the selecting
between the core and the memory, and designed the TCU signal of the next mux, which determines the final output
targeting the most commonly used AXI protocol [28], [29]. address.
For reference, the AXI protocol consists of multiple channels The right side block of the TCU in FIGURE 11 implements
of a read address (AR) channel, a write address (AW) channel, the byte level management function of the TCU. This block
a read data (R) channel, a write data (W) channel, and a write consists of a FIFO that stores the information received from
response (B) channel, each of which works independently. the block on the left, a logic to create a burst address of TC
Of the multiple AXI channels, the TCU is designed to imple- data (we call this the Burst Address Generator), and a logic to
ment the virtual address mapping function by controlling the control the byte enable signal (we call this the WSTRB1 Mask
signal of the AR/AW channel, and the byte level management Generator). The Burst Address Generator receives TC signal
function by controlling the signal of the W channel. The R and AW channel information from FIFO as inputs (cf. FIFO
and B channels in the AXI protocol are transmitted without Out1 in the figure), and generates the burst address as output.
any control from the TCU. The proposed architecture of the Along with this burst address, the WSTRB mask generator
TCU is illustrated in FIGURE 11, and as shown in the figure, takes as input the TC signal, matched start addresses, matched
the TCU is largely composed of a block (on the left side of last addresses, and matched signals from the FIFO (cf. FIFO
the figure) that takes the transfer of the AR/AW channel as an Out2 in the figure), and outputs a WSTRB mask for byte level
input, and a block (on the right side of the figure) that takes management.
the transfer of the W channel as an input. The main operation of this block is as follows. According
The block responsible for the virtual address mapping to the write operation of the AXI protocol, the write infor-
function of the TCU has dedicated hardware, called a TC mation generated by the AW channel (cf. AW channel info.
entry, for mapping each variable to which TC is applied to in FIGURE 11) is stored in the FIFO of this block, and data
its main memory address. The total number of TC entries in is currently entering this block from the W channel once or
the block is the same as the number of variables to which several times in a certain size unit, depending on the data
TC can be applied at the same time, and platform developers transfer mode determined based on the AW channel info.
can adjust this number as necessary. Each TC entry has two Then, when all addresses of the data correspond to TC data,
registers, each of which is to store the start address or the i.e., when the TC signal is 1, the Burst Address Generator
last address of the variable received from the TC API in calculates the burst address of the corresponding data for each
order to determine whether the address accessed by the core data transfer by using the AW channel information from the
through the AXI interface is the address of the variable to FIFO. Simultaneously, when the TC signal is 1, the WSTRB
which TC is applied. Also, in the TC entry, there is a register Mask Generator compares the burst address with the matched
to store the offset that is received from the TC API, which start address and matched last address to determine whether
is used to convert the address of the TC variable to the the transmitted data is TC data in byte levels. The WSTRB
address pointed by the original variable. Meanwhile, unlike mask signal is then generated by setting the bit of the WSTRB
TC entries, the mux-based control logic for each TC entry has mask corresponding to TC data to 1, otherwise, the bit of
the same configuration and function with each other, so we the WSTRB mask to 0. On the other hands, when the TC
designed multiple TC entries to share one logic to reduce signal is 0, meaning that the data is non-TC data and its
unnecessary overhead. address belongs to the non-cacheable region, the WSTRB
The main operation of the block being described is as
follows. When an input address enters the TC entry from 1 In the AXI protocol, the byte enable signal is called the WSTRB signal.

VOLUME 9, 2021 120017

H. Jang et al.: Developing Multicore Platform Utilizing Open RISC-V Cores

FIGURE 11. The proposed TCU architecture.

Mask Generator instantly sets all bits of the WSTRB mask

to 1. Finally, a bitwise AND operation is performed between
the generated WSTRB mask signal and the input WSTRB
signal, and the converted output WSTRB signal is sent out
from the TCU. Owing to the converted output WSTRB signal,
only the TC data excluding the invalid portion of the data is
written to the main memory as shown in FIGURE 10, so the
cache-line problem due to the unintentionally-cached data
does not occur.

V. EXPANSION OF TC CAPABILITY
A. EMBEDDING THE TCU INTO NETWORK-ON-CHIP
In order to answer question of where it is best to implement
the developed TCU in a multicore platform, we focused on
NoC, which plays a pivotal role of concurrent communication FIGURE 12. Architectures of (a) the conventional NoC and (b) the
proposed NoC with the embedded TCU.
between IPs in the platform. Owing to the ability of NoC
to overcome the limitations of the conventional bus-based
system interconnects (e.g., unbearable increasing density NI within the NoC as a general solution for TCU
and complexity induced by the system interconnect) [27], implementation.
[30]–[32], NoC is commonly used in the state-of-the-art In this paper, we implemented the TCU in our own NoC
multicore platforms. FIGURE 12 (a) shows the conventional based on the presented architecture in [27], that is a com-
NoC architecture, and the processor core in the platform pactly designed NoC that supports various types of IP inter-
communicates with other IPs only through the dedicated face conversion and has been silicon-proven in a fabricated
network interface (NI) of NoC [28], [33]. Therefore, since SoC. To embed the TCU in the NoC, we first designed the
the developed TCU operates independently between the core TCU to have the advanced peripheral bus (APB) interface
and the network, if it is embedded in NI, TC can be realized to configure the start address register, last address register,
on the platform no matter what cores are used. In addition, and the offset register in the TCU. This APB interface is
as shown in FIGURE 12 (b), the design of placing the TCU connected to the NoC as shown in FIGURE 12 (b), so that
inside the NoC does not require modification of the original the core can control the TCU using simple read/write memory
internal structure of NI, so adding a TCU to NI can be easily operations. Next, we placed the TCU between the core and the
designed without being limited to a specific NoC. In the existing NI, so that AXI data between the core and NI must
end, we propose to embed the TCU in the core-dedicated be processed through the TCU.

120018 VOLUME 9, 2021

H. Jang et al.: Developing Multicore Platform Utilizing Open RISC-V Cores

TABLE 2. Clock speed of each IP on the prototype multicore

platform (MHz).

TABLE 3. Resource consumption on the FPGA.

FIGURE 13. TC enabled multicore platform.

B. ENABLING DESIGN AUTOMATION OF RISC-V

MULTICORE PLATFORMS WITH THE TCU
In a previously published paper [34], we introduced a new
electronic design automation (EDA) tool, RISC-V eXpress
(RVX), that allows the SoC developers to quickly and
easily create SoC platforms using a variety of RISC-V
cores. Indeed, there are many open source RISC-V cores
and it is not difficult to acquire them, but the process of
developing SoCs using such open source cores is very com- controllers (VIC/VOC), and peripherals including UART,
plex, which requires a lot of time and effort with high I2C, etc. Finally, all the IPs are interconnected with the
design skills and experience. To tackle this and ultimately developed TCU embedded NoC, that NoC has four NIs for
accelerate SoC development, the RVX is developed to gen- each core, and each NI has a TCU with eight TC entries.
erate Verilog RTL codes, an FPGA prototype, and software Clocks of IPs are summarized in TABLE 2.
development kit (SDK) for the target SoC, when the IPs To utilize video input/output, we designed a custom FPGA
to be integrated into the SoC are given using a high-level board. It contains a Xilinx FPGA chip (Virtex UltraScale+),
description. DDR4 memories, a camera, and an LCD screen. FIGURE 13
In this paper, we have integrated the proposed TCU into shows the architecture of the developed platform, along with
RVX, so that RISC-V-based multicore SoCs equipped with a a picture of the actual implementation prototyped on the
TCU can be automatically generated through the RVX. More custom board.
specifically, we have implemented the TCU-embedded NoC The platform prototype was synthesized by using Xilinx
in the RVX, allowing users to select this NoC as on-chip Vivado [36], and resulting resource consumption of the TCU
communication IP in the target SoC platform that connects and the others are reported in TABLE 3. The four TCUs
the selected RISC-V cores and various necessary IPs. We also consume 4,424 look-up tables (LUTs) and 3,644 flip-flops
have upgraded the RVX to support an interface that enables (FFs), which takes only 3.9% and 3.1% in the entire platform.
users specify the number of TC entries per TC on the target
platform. Finally, by using the upgraded RVX, we prototyped B. DEVELOPING AN APPLICATION
a TCU embedded RISC-V multicore platform and set up a We developed a handwriting recognition application based
software development environment. A detailed description of on camera input as an application to verify the validity of
the experimental work performed using this is presented in TC. In fact, the handwriting recognition applications are
the next section. commonly used as a basic example of the deep neural network
(DNN) [37], [38]. This basic handwriting recognition appli-
VI. EXPERIMENTAL WORK cation recognizes an image in which one of the numbers 0 to
A. PROTOTYPING A RISC-V MULTICORE PLATFORM 9 is handwritten, determines which number the image is, and
To verify the function and effectiveness of the proposed shows the result. As the DDN architecture for this application,
TC, we have implemented a complete verification system an architecture consisting of two convolution layers, two max
including the TC embedded multicore platform. Especially, pooling layers, and two fully connected layers was used,
the prototype platform was designed to have a quad-core, as shown in FIGURE 14. For DDN training, MNIST [39],
and for this, four Rocket [35] cores based on the RISC-V the well-known handwritten image database, was used, and
were implemented on the platform, each of which core was the parameters of the DNN trained in a Linux PC using
created as a single core without CCL. Additionally, this TensorFlow [40], a deep learning framework, were applied
platform has a 512 MB DDR memory, video input/output to the handwriting recognition application.

VOLUME 9, 2021 120019

H. Jang et al.: Developing Multicore Platform Utilizing Open RISC-V Cores

FIGURE 14. Multi-processing of the handwriting recognition application.

The MNIST database is composed of images of handwrit-

ten numbers, each image is in the format of 8 bits grayscale
28 × 28 pixels, and the handwritten numbers in the images
are located at a certain size in the center of the image.
In addition, since the MNIST database was used for training
DNN, the input image of the DNN must be in the same
format as the image of the MNIST database, and to improve
the performance of DNN inference, the handwritten number
should be located in the center of the image at a certain size,
as do the images in the MNIST database.
In our target application that uses images taken by the cam-
era connected to the multicore prototype, not only the format
FIGURE 15. Measured execution time of the testbench running on the
of the image obtained from the camera is different from that TC-enabled quad-core platform.
of the MNIST database, but the handwritten number may not
be located in the center of the image. Therefore, to convert
the image received from the camera into the image format of prototyped on a custom FPGA, and the results of the experi-
the MNIST database, and to place the number in the center ment are reported in detail in the following subsection.
of the image, we needed to implement image pre-processing
such as contrast, bounding box, and resize (cf. yellow boxes C. PERFORMANCE IMPROVEMENT RESULT
in FIGURE 14). Finally, including this image pre-processing The testbenches are set by varying the number of variables
part, we have implemented the camera-input based handwrit- to which TC is applied to the developed handwriting recog-
ing recognition application by coding all of the inference nition application code. Then, the program execution times
parts of the DNN in C language. are measured by operating each testbench on the TC enabled
Using the developed handwriting recognition application quad-core platform (cf. FIGURE 13). More specifically,
as a baseline, to experiment how much the performance the number of variables to which TC is applied is 0 as the
improves when TC is applied, we wrote a testbench that baseline (ie, TC is not applied), 1 to 5, and the measured exe-
applies TC to the baseline code that processes data in parallel. cution time of each case is reported in FIGURE 15. As seen in
More specifically, by coding the testbench that applies TC to the figure, the execution time continuously decreases as the
the image preprocessing process, we made the preprocessing number of variables to which TC is applied increases. As a
performed in parallel on 4 cores. After the pre-processing result, when TC is applied to the five variables, its execution
process, the convolution layer and the max pooling layer of time is greatly shortened compared to the baseline, achieving
the application are distributed to 4 cores in unit of feature a performance improvement of about 37%.
map, and multilayer perceptron (MLP) is distributed and
processed in parallel through implementation. D. COMPARISON WITH THE OTHER APPROACHES
The developed testbench and baseline application were run We evaluate how close the performance improvement of the
on a multicore platform consisting of four RISC-V cores proposed TC is to that of a multi-core platform using CCL.

120020 VOLUME 9, 2021

H. Jang et al.: Developing Multicore Platform Utilizing Open RISC-V Cores

TABLE 4. Comparison of quad-core platforms with different coherency performance degradation problem. As a solution of this prob-
schemes implemented on the FPGA.
lem, we proposed the TC, a method that improves the perfor-
mance of a multicore platform by enabling caching of data
that are definite to not be shared for a certain period of time.
Through a sophisticated operation mechanism, the proposed
TC achieves performance improvement of the multicore plat-
form while preventing the problem that can occur when
TC data and non-TC data are on the same cache-line that
can cause a fatal system error. To implement the proposed
TC, we developed TC API for programmers and TC dedi-
cated hardware, TCU, for platform developers, and detailed
descriptions of each implementation method were provided in
this paper. Especially, since the TCU operates independently
of the core, it is possible to develop a TC-enabled multicore
platform no matter what RISC-V cores are used. In addition,
we proposed a method of embedding and implementing TCU
in NoC in a multicore platform to facilitate the convenience
of platform developers. Finally, in order to verify the effec-
tiveness of the proposed TC, we implemented a quad-core
platform equipped with TCU on the FPGA, and developed
a handwriting recognition application with TC applied as a
testbench. Through experimental work, we demonstrated that
To this end, we used the Rocket [35] cores, which fortunately by applying TC, the performance of a multicore platform can
also offers a 4-core version with a dedicated CCL along with be improved up to about 37% compared to the performance
a single-core version. The results of comparative analysis for of a platform without TC.
the hardware-based approach using CCL, the software-based
approach that does not allow shared variable cache at all, ACKNOWLEDGMENT
and the proposed TC are reported in TABLE 4 as HW, (Hyeonguk Jang and Kyuseung Han contributed equally to
practical-SW, and TC, respectively. As can be seen from the this work.)
table, the HW approach often presents a very high level of
difficulty in developing the CCL directly, so only limited REFERENCES
platform development is practically possible using only the [1] J. L. Hennessy and D. A. Patterson, ‘‘A new golden age for
few types of cores that come with the CCL. On the other hand, computer architecture,’’ Commun. ACM, vol. 62, no. 2, pp. 48–60,
practical-SW and TC approaches have low platform develop- Jan. 2019.
[2] RISC-V. Accessed: Feb. 23, 2020. [Online]. Available: https://riscv.org/
ment challenges, no matter which core is used to develop a
[3] S. Greengard, ‘‘Will RISC-V revolutionize computing?’’ Commun. ACM,
multi-core platform. Looking at the difficulties of developing vol. 63, no. 5, pp. 30–32, Apr. 2020.
applications that work on the developed platform, the TC [4] D. Patterson, ‘‘50 years of computer architecture: From the mainframe
approach provides an easy-to-use API, but it is still difficult CPU to the domain-specific tpu and the open RISC-V instruction set,’’ in
IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2018,
compared to the HW or practical-SW approach. Meanwhile, pp. 27–31.
in FPGA prototyping, the hardware resource consumption [5] B. Chapman, L. Huang, E. Biscondi, E. Stotzer, A. Shrivastava, and
results of CCL and TCU show that the HW approach requires A. Gatherer, ‘‘Implementing OpenMP on a high performance embedded
multicore MPSoC,’’ in Proc. IEEE Int. Symp. Parallel Distrib. Process.,
more hardware resources than the TC approach. In addi- May 2009, pp. 1–8.
tion, for performance comparison, the HW approach has the [6] A. C. Sodan, J. Machina, A. Deshmeh, K. Macnaughton, and B. Esbaugh,
shortest application execution time as expected, but the TC ‘‘Parallelism via multithreaded and multicore CPUs,’’ Computer, vol. 43,
no. 3, pp. 24–32, Mar. 2010.
approach is also close. Of course, the performance of these [7] Y. Kanehagi, D. Umeda, A. Hayashi, K. Kimura, and H. Kasahara, ‘‘Par-
two is far better than the practical SW approach. In the end, allelization of automotive engine control software on embedded multi-
when developing multicore platforms using different types of core processor using OSCAR compiler,’’ in Proc. 16th IEEE COOL Chips,
Apr. 2013, pp. 1–3.
RISC-V cores, the proposed TC approach may be the general [8] S. Davidson, S. Xie, C. Torng, K. Al-Hawai, A. Rovinski, T. Ajayi,
solution, as it is flexible and easy, and the developed platform L. Vega, C. Zhao, R. Zhao, S. Dai, A. Amarnath, B. Veluri, P. Gao, A. Rao,
has good performance. G. Liu, R. K. Gupta, Z. Zhang, R. Dreslinski, C. Batten, and M. B. Taylor,
‘‘The celerity open-source 511-core RISC-V tiered accelerator fabric:
Fast architectures and design methodologies for fast chips,’’ IEEE Micro,
VII. CONCLUSION vol. 38, no. 2, pp. 30–41, Mar./Apr. 2018.
Considering that when developing multicore platforms using [9] M. Strobel and M. Radetzki, ‘‘Design-time memory subsystem optimiza-
tion for low-power multi-core embedded systems,’’ in Proc. IEEE 13th Int.
various RISC-V cores, it is difficult to implement the ded- Symp. Embedded Multicore/Many-Core Syst.-Chip (MCSoC), Oct. 2019,
icated CCL within each platform, resulting in a serious pp. 347–353.

VOLUME 9, 2021 120021

H. Jang et al.: Developing Multicore Platform Utilizing Open RISC-V Cores

[10] M. Wang, T. Ta, L. Cheng, and C. Batten, ‘‘Efficiently supporting dynamic [32] K. Han, S. Lee, J.-J. Lee, W. Lee, and M. Pedram, ‘‘TIP: A temperature
task parallelism on heterogeneous cache-coherent systems,’’ in Proc. effect inversion-aware ultra-low power system-on-chip platform,’’ in Proc.
ACM/IEEE 47th Annu. Int. Symp. Comput. Archit. (ISCA), May 2020, IEEE/ACM Int. Symp. Low Power Electron. Design (ISLPED), Jul. 2019,
pp. 173–186. pp. 1–6.
[11] M. Ferdman, P. Lotfi-Kamran, K. Balet, and B. Falsafi, ‘‘Cuckoo directory: [33] M. Schoeberl, L. Pezzarossa, and J. Sparsø, ‘‘A minimal network interface
A scalable directory for many-core systems,’’ in Proc. IEEE 17th Int. Symp. for a simple network-on-chip,’’ in Architecture of Computing Systems.
High Perform. Comput. Archit., Feb. 2011, pp. 169–180. Cham, Switzerland: Springer, 2019, pp. 295–307.
[12] M. M. K. Martin, M. D. Hill, and D. J. Sorin, ‘‘Why on-chip cache [34] K. Han, S. Lee, K.-I. Oh, Y. Bae, H. Jang, J.-J. Lee, W. Lee, and
coherence is here to stay,’’ Commun. ACM, vol. 55, no. 7, pp. 78–89, M. Pedram, ‘‘Developing TEI-aware ultralow-power SoC platforms for
Jul. 2012. IoT end nodes,’’ IEEE Internet Things J., vol. 8, no. 6, pp. 4642–4656,
[13] Y. Fu, T. M. Nguyen, and D. Wentzlaff, ‘‘Coherence domain restriction on Mar. 2021.
large scale systems,’’ in Proc. 48th Annu. IEEE/ACM Int. Symp. Microar- [35] K. Asanović et al., ‘‘The rocket chip generator,’’ Dept. EECS, Univ.
chitecture (MICRO), New York, NY, USA, Dec. 2015, pp. 686–698. California, Berkeley, Berkeley, CA, USA, Tech. Rep. UCB/EECS-
[14] H. Kim, A. Kandhalu, and R. Rajkumar, ‘‘A coordinated approach for 2016-17, Apr. 2016. [Online]. Available: http://www2.eecs.berkeley.
practical OS-level cache management in multi-core real-time systems,’’ in edu/Pubs/TechRpts/2016/EECS-2016-17.html
Proc. 25th Euromicro Conf. Real-Time Syst., Jul. 2013, pp. 80–89. [36] Xilinx. Vivado 2016.4. Accessed: Feb. 23, 2020. [Online]. Available:
[15] M. Hassan, A. M. Kaushik, and H. Patel, ‘‘Predictable cache coherence https://www.xilinx.com/support/download/index.html/content/xilinx/en/
for multi-core real-time systems,’’ in Proc. IEEE Real-Time Embedded downloadNav/vivado-design-tools/2016-4.html
Technol. Appl. Symp. (RTAS), Apr. 2017, pp. 235–246. [37] D. Ciresan, U. Meier, and J. Schmidhuber, ‘‘Multi-column deep neural
[16] S. Li and D. Guo, ‘‘Cache coherence scheme for HCS-based CMP and its networks for image classification,’’ in Proc. IEEE Conf. Comput. Vis.
system reliability analysis,’’ IEEE Access, vol. 5, pp. 7205–7215, 2017. Pattern Recognit., Jun. 2012, pp. 3642–3649.
[38] B. Hutchinson, L. Deng, and D. Yu, ‘‘Tensor deep stacking networks,’’
[17] M. Gupta, V. Sridharan, D. Roberts, A. Prodromou, A. Venkat, D. Tullsen,
IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 8, pp. 1944–1957,
and R. Gupta, ‘‘Reliability-aware data placement for heterogeneous mem-
Aug. 2013.
ory architecture,’’ in Proc. IEEE Int. Symp. High Perform. Comput. Archit.
[39] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, ‘‘Gradient-based learn-
(HPCA), Feb. 2018, pp. 583–595.
ing applied to document recognition,’’ Proc. IEEE, vol. 86, no. 11,
[18] A. Ros, M. E. Acacio, and J. M. Garcia, ‘‘DiCo-CMP: Efficient cache
pp. 2278–2324, Nov. 1998.
coherency in tiled CMP architectures,’’ in Proc. IEEE Int. Symp. Parallel
[40] M. Abadi et al., ‘‘TensorFlow: A system for large-scale
Distrib. Process., Apr. 2008, pp. 1–11.
machine learning,’’ in Proc. 12th USENIX Symp. Oper. Syst.
[19] I.-C. Lin and J.-N. Chiou, ‘‘High-endurance hybrid cache design in CMP
Design Implement. (OSDI), Savannah, GA, USA, Nov. 2016,
architecture with cache partitioning and access-aware policies,’’ IEEE
pp. 265–283. [Online]. Available: https://www.usenix.org/conference/
Trans. Very Large Scale Integr. (VLSI) Syst., vol. 23, no. 10, pp. 2149–2161,
osdi16/technical-sessions/presentation/abadi
Oct. 2015.
[20] U. Milic, A. Rico, P. Carpenter, and A. Ramirez, ‘‘Sharing the instruction
cache among lean cores on an asymmetric CMP for HPC applications,’’
in Proc. IEEE Int. Symp. Perform. Anal. Syst. Softw. (ISPASS), Apr. 2017, HYEONGUK JANG received the B.S. and
pp. 3–12. M.S. degrees in electrical engineering from
[21] G. G. Shahidi, ‘‘Chip power scaling in recent CMOS technology nodes,’’ Gyeongsang National University, Jinju, South
IEEE Access, vol. 7, pp. 851–856, 2019. Korea, in 2013 and 2015, respectively. He is cur-
[22] M. Ansari, M. Pasandideh, J. Saber-Latibari, and A. Ejlali, ‘‘Meeting rently pursuing the Ph.D. degree with the Uni-
thermal safe power in fault-tolerant heterogeneous embedded systems,’’ versity of Science and Technology. He has been
IEEE Embedded Syst. Lett., vol. 12, no. 1, pp. 29–32, Mar. 2020. with the SoC Design Research Group, Electronics
[23] S. Chakraborty and H. K. Kapoor, ‘‘Exploring the role of large centralised and Telecommunications Research Institute. His
caches in thermal efficient chip design,’’ ACM Trans. Design Autom. research interests include network-on-chip and
Electron. Syst., vol. 24, no. 5, pp. 1–28, Oct. 2019. system software in embedded systems.
[24] M. Rapp, M. Sagi, A. Pathania, A. Herkersdorf, and J. Henkel,
‘‘Power- and cache-aware task mapping with dynamic power budget-
ing for many-cores,’’ IEEE Trans. Comput., vol. 69, no. 1, pp. 1–13,
Jan. 2020. KYUSEUNG HAN (Member, IEEE) received
[25] B. Choi, R. Komuravelli, H. Sung, R. Smolinski, N. Honarmand, the B.S. and Ph.D. degrees in electrical engi-
S. V. Adve, V. S. Adve, N. P. Carter, and C.-T. Chou, ‘‘DeNovo: Rethinking neering and computer science from Seoul
the memory hierarchy for disciplined parallelism,’’ in Proc. Int. Conf. National University (SNU), Seoul, South Korea,
Parallel Archit. Compilation Techn., Oct. 2011, pp. 155–166. in 2008 and 2013, respectively. At SNU,
[26] J. Cai and A. Shrivastava, ‘‘Software coherence management on non- he researched on computer architecture and design
coherent cache multi-cores,’’ in Proc. 29th Int. Conf. VLSI Design, 15th automation. Since 2014, he has been working with
Int. Conf. Embedded Syst. (VLSID), Jan. 2016, pp. 397–402. the Electronics and Telecommunications Research
[27] K. Han, J.-J. Lee, and W. Lee, ‘‘Converting interfaces on application- Institute (ETRI), Daejeon, South Korea. He cur-
specific network-on-chip,’’ J. Semicond. Technol. Sci., vol. 17, no. 4, rently belongs to the SoC Design Research Group
pp. 505–513, Aug. 2017. as a Senior Researcher. His current research interests include reconfigurable
[28] H. Jang, K. Han, S. Lee, J.-J. Lee, and W. Lee, ‘‘MMNoC: Embedding architecture, network-on-chip, and ultra-low-power techniques in embedded
memory management units into network-on-chip for lightweight embed- systems.
ded systems,’’ IEEE Access, vol. 7, pp. 80011–80019, 2019.
[29] D. Petrisko, F. Gilani, M. Wyse, D. C. Jung, S. Davidson, P. Gao,
C. Zhao, Z. Azad, S. Canakci, B. Veluri, T. Guarino, A. Joshi, M. Oskin,
SUKHO LEE received the Ph.D. degree in
and M. B. Taylor, ‘‘BlackParrot: An agile open-source RISC-V mul-
ticore for accelerator SoCs,’’ IEEE Micro, vol. 40, no. 4, pp. 93–102, information communications engineering from
Jul. 2020. Chungnam National University, Daejeon, South
[30] L. Chen, D. Zhu, M. Pedram, and T. M. Pinkston, ‘‘Power punch: Korea, in 2010. He is currently a Principal
Towards non-blocking power-gating of NoC routers,’’ in Proc. HPCA, Researcher with the SoC Design Research Group,
2015, pp. 378–389. Electronics and Telecommunications Research
[31] K. Han, J.-J. Lee, J. Lee, W. Lee, and M. Pedram, ‘‘TEI-NoC: Optimiz- Institute, Daejeon. His current research interests
ing ultralow power NoCs exploiting the temperature effect inversion,’’ include ultra-low-power system-on-chip design,
IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 37, no. 2, embedded system design, video codec design, and
pp. 458–471, Feb. 2018. video image processing.

120022 VOLUME 9, 2021

H. Jang et al.: Developing Multicore Platform Utilizing Open RISC-V Cores

JAE-JIN LEE received the B.S., M.S., and Ph.D. JAE-HYOUNG LEE (Student Member, IEEE)
degrees in computer engineering from Chungbuk received the B.S. degree from Myoungji University,
National University, in 2000, 2003, and 2007, Yong-In, South Korea, in 2020. He is currently
respectively. He is currently a Group Leader with pursuing the M.S. degree in electrical and elec-
the SoC Design Research Group, Electronics and tronics engineering with Chung-Ang University.
Telecommunications Research Institute, and also a He is currently a Beneficiary Student of the
Professor with the Department of ICT, University High-Potential Individuals Global Training Pro-
of Science and Technology. His research interests gram. His research interests include low-power
include processor and compiler designs in ultra- design, SoC architecture, and embedded systems.
low-power embedded systems.

WOOJOO LEE (Member, IEEE) received the

B.S. degree in electrical engineering from Seoul
National University, Seoul, South Korea, in 2007,
and the M.S. and Ph.D. degrees in electrical
engineering from the University of Southern
California, Los Angeles, CA, USA, in 2010 and
SEUNG-YEONG LEE (Student Member, IEEE) 2015, respectively. He was with the Electronics
received the B.S. degree from Chung-Ang and Telecommunications Research Institute, from
University, Seoul, South Korea, in 2020, where he 2015 to 2016, as a Senior Researcher with the SoC
is currently pursuing the M.S. degree in electrical Design Research Group, Department of Electrical
and electronics engineering. He is currently a Engineering, Myongji University, from 2017 to 2018, as an Assistant Pro-
Beneficiary Student of the High-Potential Individ- fessor. He is currently an Assistant Professor with the School of Electrical
uals Global Training Program. His research inter- and Electronics Engineering, Chung-Ang University, Seoul. His research
ests include low-power design, SoC architecture, interests include ultra-low-power VLSI and SoC designs, embedded system
and embedded systems. designs, and system-level power and thermal management.