Developing A Multicore Platform Utilizing Open RISC-V Cores
Developing A Multicore Platform Utilizing Open RISC-V Cores
ABSTRACT RISC-V has been experiencing explosive growth since its first appearance in 2011. Dozens
of free and open cores developed based on this instruction set architecture have been released, and RISC-V
based devices optimized for specific applications such as the IoT and wearables, embedded systems, AI, and
virtual, augmented reality are emerging. As the RISC-V cores are being used in various fields, the demand
for multicore platforms composed of RISC-V cores is also rapidly increasing. Although there are various
RISC-V cores developed for each specific application, and it seems possible to pick them up to create
the most optimized multicore for the target application, unfortunately it is very difficult to realize this in
reality. This is mainly because most open cores are released in the form of a single core without cache
coherence logic, which requires expensive design effort and development costs to address it. To tackle this
issue, this paper proposes a method to solve the cache coherence problem without additional effort from the
developer and to maximize the performance of the multicore composed of the RISC-V core selected by the
developer. Along with a description of the sophisticated operating mechanisms of the proposed method, this
paper details the architecture and hardware implementation of the proposed method. Experiments conducted
through the prototype development of a RISC-V multicore platform involving the proposed architecture and
development of an application running on the platform demonstrate the effectiveness of the proposed method.
INDEX TERMS Multicore platform, RISC-V, system-on-chip (SoC), electronic design automation (EDA).
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
120010 VOLUME 9, 2021
H. Jang et al.: Developing Multicore Platform Utilizing Open RISC-V Cores
the open RISC-V cores, especially released as a single core, FIGURE 2. Solutions to the cache coherence problem: (a) simply
poses unfortunately enormous challenges. This is mainly due non-caching the shared data and (b) using CCL.
to the cache coherence problem. Without solving this prob-
lem, the correct operation of the multicore is not guaranteed, The main contributions of this paper may be summarized
or the expected performance is not exhibited. The currently as follows:
possible method to solve this problem is to implement the
cache coherence logic (CCL) for each multicore by the plat- – As the most practical solution to the cache coherence
form developers themselves, but this has a critical limitation problem in multicore development with RISC-V cores,
that the design effort and development costs are very expen- a temporary caching (TC) method is presented.
sive and sometimes impossible. – Along with the sophisticated operation mechanism of
To address the difficulty of developing multicore platforms the proposed TC, a detailed description of the hardware
utilizing RISC-V cores, we propose a method that solves and software development for TC is provided.
the cache coherence problem without CCL and maximizes – Through prototyping of a RISC-V-based multicore plat-
the performance of multicore regardless of which RISC-V form to which TC is applied and the development of the
cores are used in the multicore platform. The main idea of application running on this platform, the effectiveness of
the proposed method is to avoid cache coherence problem by the proposed solution is verified.
disallowing caching on shared data by default, and to allow The remainder of this paper is organized as follows.
temporarily caching on data that are obviously not shared for Section II elucidates the cache coherence problem that
a period of time in order to compensate for the performance can occur when a multicore is configured based on the
degradation caused by the inability to use the cache. We put RISC-V cores, and the existing solutions associated with
it into the role of the programmers to determine which data them. Section III introduces the main idea of the proposed
are temporarily cached, but provide simple application pro- method and discusses problems that may arise from the
gramming interface (API) functions to make it easier for the proposed method. Next in Section IV, a detailed descrip-
programmers to apply this method when developing applica- tion of the architecture and hardware implementation for the
tions. In addition, we analyze and identify problems that may proposed method are presented. Implementing the proposed
arise when the proposed method is applied to existing cache architecture on NoC and developing it to be automati-
structure, and devise sophisticated behavior mechanisms of cally designed are provided in Section V. Section VI is
the proposed method to address them. Next, we develop to develop a prototyped RISC-V multicore platform and
the architecture to realize the proposed method in RISC-V applications as test benches and to provide experimental
core-based multicore platform, and implement the necessary results obtained from them. Finally, Section VII concludes the
hardware. We then build the proposed architecture into a paper.
network-on-chip (NoC) responsible for IP-to-IP communica-
tion in system-on-chip (SoC), so even if the developer selects II. CACHE COHERENCY PROBLEM IN RISC-V MULTICORE
any RISC-V cores and configures multicores, the proposed Cache coherency problem is a well-known problem on
method can be applied. Moreover, by including the proposed multicore due to the caches being distributed across indi-
architecture in the RISC-V-based SoC automatic design tool, vidual cores. Since each core has its own cache, the copy
we try to increase the usability of the proposed method. of the shared data in that cache may not always be the
Finally, to verify the effectiveness of the proposed method, most up-to-date version, resulting in data synchronization
we implement the RISC-V multicore prototype platform failures that possibly crash the program or the entire com-
including the proposed architecture on an FPGA, and develop puter. FIGURE 1 shows a simple example of the cache
a camera input-based handwriting recognition program as coherency problem. In the figure, there is a dual-core proces-
an application. Through the experimental work based on sor with Core1 and Core2, where each core brought a memory
the application running on the FPGA, we confirm that the block for the variable A into its private cache. And then
performance of the platform on which the proposed method Core2 writes 0 to A. When Core1 attempts to read A from its
is applied shows a performance improvement of about 37% cache, it will not have the latest version, producing incorrect
over those on which it is not. results.
FIGURE 4. Memory access over time on Core1 for array A, when the TC is
(a) not applied and (b) applied.
they support CCL, but these RISC-V cores have a big limit
on scalability because they have a fixed number of cores.
Furthermore, there may be cases where platform developers
want to use existing RISC-V cores to construct heterogeneous
multicores, even in these cases, developers are still faced with
the problem of having to implement their own CCL. After all, FIGURE 5. Progamming example.
platform developers want to choose the most suitable RISC-V
core from the list, but the reality is that they will have a hard
time building a multicore platform by scaling regardless of
the cache coherence type.
the program execution. However, the moment a write to TC or write-through. The corresponding TC-MMU registers are
data occurs, the entire cache line containing the TC data is also initialized, eliminating the mapping between the new
changed to dirty, which means dummies can also be written variable and original data in the memory map. tc_free
to the memory during the write-back process, resulting in also contains a garbage collection process since the mem-
inconsistency. As a solution to this problem, we propose a ory space for the TC heap is not infinite. The process can
method that uses a speculative approach to cache all data in be implemented simply without any complex algorithm by
the cache line containing TC data and ensure that these data tracking the number of alive TC variables in the two APIs;
work correctly when they are written back to memory, which tc_malloc increases the number and tc_free decreases
is another major topic in this section. it. If the number becomes zero after the decrement, tc_free
initializes a TC heap pointer, which is the variable to assign
B. VIRTUAL ADDRESS MAPPING the next virtual address, to the start address of the TC heap.
Virtual address mapping requires three components, the This initialization almost always takes place at the end of
TC-MMU, TC heap, and TC APIs. First, the TC-MMU is the function, so tc_malloc can now repeatedly issue the
a hardware unit that translates virtual address to physical virtual address.
address. It is similar to MMU for page table processing, but
much simpler since direct translation is its only function. C. BYTE LEVEL MANAGEMENT
Next, the system software prepares the TC heap at compile The basic idea to prevent the side effect of unintentionally-
time, which is a memory space in the cacheable region but cached dummies is as follows: we stick with the traditional
does not contain actual data. It has a start address and an way that cache data is moved on a per-cache-line basis, but
end address, but it does not include a compiled binary that when data from the cache is written back to main memory,
is identical to the original heap used for dynamic memory the dummies must not be written back to the main memory.
allocation. Lastly, the TC API functions, tc_malloc and For example, FIGURE 10 shows an example of the proposed
tc_free, are designed to perform the virtual address map- method. In the figure, each TC data to is displayed in blue
ping internally. tc_malloc issues a virtual address when a and orange, respectively, and the gray areas indicate the
physical address and the size of the target variable are given. unintentionally-cached dummies. As seen in the figure, when
When a programmer calls this function with the two parame- writing back the data in main memory, only the blue and
ters, the function allocates a specified amount of memory in orange areas excluding the gray areas should be written.
the TC heap and returns its address. At the same time, it also To realize this idea, we propose a byte-level management
configures the TC-MMU with the original address and new method that exploits the byte enable signal, which is used to
address by writing some registers. The description of these determine the validation of data in the conventional bus pro-
registers is given in Section IV-D, which presents detailed tocol [27]. In the bus protocol, data read or write is performed
description of the hardware for TC. in units of data bus width, which is the size of data transferred
During the heap allocation, the issued address must be per clock. If data smaller than the data bus width is transferred
aligned to the size of the cache line. This is to prevent for the write operation, an unintended value may be written
problems that may occur due to cache-line overlap between to the memory. To prevent this, a byte enable bit is placed for
different TC data in the TC heap. Moreover, we allocate each byte in a byte lane to determine the validity of the data.
memory in the way that the cache-line-offset of the virtual Therefore, as many byte enable bits are used as the number of
address is the same as that of the original physical one. The bytes of the data bus width. Meanwhile, in the case of a read
cache-line-offset refers to a value of the lower bits that are operation, when reading data smaller than the data bus width,
smaller than the size of the cache-line among all address no byte enable bit is needed because there is no problem even
bits. This approach will reduce the complexity of address if the data is not used except for the required part by taking
translation logics in TC-MMU. the data as much as the data bus width.
FIGURE 9 shows an example of assigning a new variable In the proposed byte level management, only the byte
to the TC heap, where the start address and the size of the enable bits of the part corresponding to the TC data are set
variable are 0 × 10001068 and 0 × 40, respectively, and the to 1, and the rest are set to 0. For instance, in FIGURE 10,
size of the cache-line is 0 × 20. As shown by the blue part in only the byte enable bit corresponding to the blue and orange
this figure, the start address of the new variable becomes 0×8 data becomes 1, the rest becomes 0, and then only the
and the last address becomes 0 × 47, and the space allocated part with the corresponding byte enable bits 1 is written
to the variable is larger than that, which is from 0×0 to 0×60 back to main memory. Along with the byte-level manage-
as shown in green in the figure. ment mechanism, we have designed optimized hardware
When it is no longer possible to apply TC, the programmer for this, and thanks to this hardware support, the system
executes the tc_free to flush the TC data in the cache itself ensures that there are no data inconsistency due to the
and to release the allocated space for the variable in the unintentionally-cached dummies, allowing the programmer
TC heap. Since other variables can be allocated to the same to actively and conveniently use TC without worrying about
address of current TC data in the future, tc_free must the cache-line problem. A detailed description of the designed
invalidate all the caches whether cache policy is write-back hardware is provided in the following section.
V. EXPANSION OF TC CAPABILITY
A. EMBEDDING THE TCU INTO NETWORK-ON-CHIP
In order to answer question of where it is best to implement
the developed TCU in a multicore platform, we focused on
NoC, which plays a pivotal role of concurrent communication FIGURE 12. Architectures of (a) the conventional NoC and (b) the
proposed NoC with the embedded TCU.
between IPs in the platform. Owing to the ability of NoC
to overcome the limitations of the conventional bus-based
system interconnects (e.g., unbearable increasing density NI within the NoC as a general solution for TCU
and complexity induced by the system interconnect) [27], implementation.
[30]–[32], NoC is commonly used in the state-of-the-art In this paper, we implemented the TCU in our own NoC
multicore platforms. FIGURE 12 (a) shows the conventional based on the presented architecture in [27], that is a com-
NoC architecture, and the processor core in the platform pactly designed NoC that supports various types of IP inter-
communicates with other IPs only through the dedicated face conversion and has been silicon-proven in a fabricated
network interface (NI) of NoC [28], [33]. Therefore, since SoC. To embed the TCU in the NoC, we first designed the
the developed TCU operates independently between the core TCU to have the advanced peripheral bus (APB) interface
and the network, if it is embedded in NI, TC can be realized to configure the start address register, last address register,
on the platform no matter what cores are used. In addition, and the offset register in the TCU. This APB interface is
as shown in FIGURE 12 (b), the design of placing the TCU connected to the NoC as shown in FIGURE 12 (b), so that
inside the NoC does not require modification of the original the core can control the TCU using simple read/write memory
internal structure of NI, so adding a TCU to NI can be easily operations. Next, we placed the TCU between the core and the
designed without being limited to a specific NoC. In the existing NI, so that AXI data between the core and NI must
end, we propose to embed the TCU in the core-dedicated be processed through the TCU.
TABLE 4. Comparison of quad-core platforms with different coherency performance degradation problem. As a solution of this prob-
schemes implemented on the FPGA.
lem, we proposed the TC, a method that improves the perfor-
mance of a multicore platform by enabling caching of data
that are definite to not be shared for a certain period of time.
Through a sophisticated operation mechanism, the proposed
TC achieves performance improvement of the multicore plat-
form while preventing the problem that can occur when
TC data and non-TC data are on the same cache-line that
can cause a fatal system error. To implement the proposed
TC, we developed TC API for programmers and TC dedi-
cated hardware, TCU, for platform developers, and detailed
descriptions of each implementation method were provided in
this paper. Especially, since the TCU operates independently
of the core, it is possible to develop a TC-enabled multicore
platform no matter what RISC-V cores are used. In addition,
we proposed a method of embedding and implementing TCU
in NoC in a multicore platform to facilitate the convenience
of platform developers. Finally, in order to verify the effec-
tiveness of the proposed TC, we implemented a quad-core
platform equipped with TCU on the FPGA, and developed
a handwriting recognition application with TC applied as a
testbench. Through experimental work, we demonstrated that
To this end, we used the Rocket [35] cores, which fortunately by applying TC, the performance of a multicore platform can
also offers a 4-core version with a dedicated CCL along with be improved up to about 37% compared to the performance
a single-core version. The results of comparative analysis for of a platform without TC.
the hardware-based approach using CCL, the software-based
approach that does not allow shared variable cache at all, ACKNOWLEDGMENT
and the proposed TC are reported in TABLE 4 as HW, (Hyeonguk Jang and Kyuseung Han contributed equally to
practical-SW, and TC, respectively. As can be seen from the this work.)
table, the HW approach often presents a very high level of
difficulty in developing the CCL directly, so only limited REFERENCES
platform development is practically possible using only the [1] J. L. Hennessy and D. A. Patterson, ‘‘A new golden age for
few types of cores that come with the CCL. On the other hand, computer architecture,’’ Commun. ACM, vol. 62, no. 2, pp. 48–60,
practical-SW and TC approaches have low platform develop- Jan. 2019.
[2] RISC-V. Accessed: Feb. 23, 2020. [Online]. Available: https://riscv.org/
ment challenges, no matter which core is used to develop a
[3] S. Greengard, ‘‘Will RISC-V revolutionize computing?’’ Commun. ACM,
multi-core platform. Looking at the difficulties of developing vol. 63, no. 5, pp. 30–32, Apr. 2020.
applications that work on the developed platform, the TC [4] D. Patterson, ‘‘50 years of computer architecture: From the mainframe
approach provides an easy-to-use API, but it is still difficult CPU to the domain-specific tpu and the open RISC-V instruction set,’’ in
IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2018,
compared to the HW or practical-SW approach. Meanwhile, pp. 27–31.
in FPGA prototyping, the hardware resource consumption [5] B. Chapman, L. Huang, E. Biscondi, E. Stotzer, A. Shrivastava, and
results of CCL and TCU show that the HW approach requires A. Gatherer, ‘‘Implementing OpenMP on a high performance embedded
multicore MPSoC,’’ in Proc. IEEE Int. Symp. Parallel Distrib. Process.,
more hardware resources than the TC approach. In addi- May 2009, pp. 1–8.
tion, for performance comparison, the HW approach has the [6] A. C. Sodan, J. Machina, A. Deshmeh, K. Macnaughton, and B. Esbaugh,
shortest application execution time as expected, but the TC ‘‘Parallelism via multithreaded and multicore CPUs,’’ Computer, vol. 43,
no. 3, pp. 24–32, Mar. 2010.
approach is also close. Of course, the performance of these [7] Y. Kanehagi, D. Umeda, A. Hayashi, K. Kimura, and H. Kasahara, ‘‘Par-
two is far better than the practical SW approach. In the end, allelization of automotive engine control software on embedded multi-
when developing multicore platforms using different types of core processor using OSCAR compiler,’’ in Proc. 16th IEEE COOL Chips,
Apr. 2013, pp. 1–3.
RISC-V cores, the proposed TC approach may be the general [8] S. Davidson, S. Xie, C. Torng, K. Al-Hawai, A. Rovinski, T. Ajayi,
solution, as it is flexible and easy, and the developed platform L. Vega, C. Zhao, R. Zhao, S. Dai, A. Amarnath, B. Veluri, P. Gao, A. Rao,
has good performance. G. Liu, R. K. Gupta, Z. Zhang, R. Dreslinski, C. Batten, and M. B. Taylor,
‘‘The celerity open-source 511-core RISC-V tiered accelerator fabric:
Fast architectures and design methodologies for fast chips,’’ IEEE Micro,
VII. CONCLUSION vol. 38, no. 2, pp. 30–41, Mar./Apr. 2018.
Considering that when developing multicore platforms using [9] M. Strobel and M. Radetzki, ‘‘Design-time memory subsystem optimiza-
tion for low-power multi-core embedded systems,’’ in Proc. IEEE 13th Int.
various RISC-V cores, it is difficult to implement the ded- Symp. Embedded Multicore/Many-Core Syst.-Chip (MCSoC), Oct. 2019,
icated CCL within each platform, resulting in a serious pp. 347–353.
[10] M. Wang, T. Ta, L. Cheng, and C. Batten, ‘‘Efficiently supporting dynamic [32] K. Han, S. Lee, J.-J. Lee, W. Lee, and M. Pedram, ‘‘TIP: A temperature
task parallelism on heterogeneous cache-coherent systems,’’ in Proc. effect inversion-aware ultra-low power system-on-chip platform,’’ in Proc.
ACM/IEEE 47th Annu. Int. Symp. Comput. Archit. (ISCA), May 2020, IEEE/ACM Int. Symp. Low Power Electron. Design (ISLPED), Jul. 2019,
pp. 173–186. pp. 1–6.
[11] M. Ferdman, P. Lotfi-Kamran, K. Balet, and B. Falsafi, ‘‘Cuckoo directory: [33] M. Schoeberl, L. Pezzarossa, and J. Sparsø, ‘‘A minimal network interface
A scalable directory for many-core systems,’’ in Proc. IEEE 17th Int. Symp. for a simple network-on-chip,’’ in Architecture of Computing Systems.
High Perform. Comput. Archit., Feb. 2011, pp. 169–180. Cham, Switzerland: Springer, 2019, pp. 295–307.
[12] M. M. K. Martin, M. D. Hill, and D. J. Sorin, ‘‘Why on-chip cache [34] K. Han, S. Lee, K.-I. Oh, Y. Bae, H. Jang, J.-J. Lee, W. Lee, and
coherence is here to stay,’’ Commun. ACM, vol. 55, no. 7, pp. 78–89, M. Pedram, ‘‘Developing TEI-aware ultralow-power SoC platforms for
Jul. 2012. IoT end nodes,’’ IEEE Internet Things J., vol. 8, no. 6, pp. 4642–4656,
[13] Y. Fu, T. M. Nguyen, and D. Wentzlaff, ‘‘Coherence domain restriction on Mar. 2021.
large scale systems,’’ in Proc. 48th Annu. IEEE/ACM Int. Symp. Microar- [35] K. Asanović et al., ‘‘The rocket chip generator,’’ Dept. EECS, Univ.
chitecture (MICRO), New York, NY, USA, Dec. 2015, pp. 686–698. California, Berkeley, Berkeley, CA, USA, Tech. Rep. UCB/EECS-
[14] H. Kim, A. Kandhalu, and R. Rajkumar, ‘‘A coordinated approach for 2016-17, Apr. 2016. [Online]. Available: http://www2.eecs.berkeley.
practical OS-level cache management in multi-core real-time systems,’’ in edu/Pubs/TechRpts/2016/EECS-2016-17.html
Proc. 25th Euromicro Conf. Real-Time Syst., Jul. 2013, pp. 80–89. [36] Xilinx. Vivado 2016.4. Accessed: Feb. 23, 2020. [Online]. Available:
[15] M. Hassan, A. M. Kaushik, and H. Patel, ‘‘Predictable cache coherence https://www.xilinx.com/support/download/index.html/content/xilinx/en/
for multi-core real-time systems,’’ in Proc. IEEE Real-Time Embedded downloadNav/vivado-design-tools/2016-4.html
Technol. Appl. Symp. (RTAS), Apr. 2017, pp. 235–246. [37] D. Ciresan, U. Meier, and J. Schmidhuber, ‘‘Multi-column deep neural
[16] S. Li and D. Guo, ‘‘Cache coherence scheme for HCS-based CMP and its networks for image classification,’’ in Proc. IEEE Conf. Comput. Vis.
system reliability analysis,’’ IEEE Access, vol. 5, pp. 7205–7215, 2017. Pattern Recognit., Jun. 2012, pp. 3642–3649.
[38] B. Hutchinson, L. Deng, and D. Yu, ‘‘Tensor deep stacking networks,’’
[17] M. Gupta, V. Sridharan, D. Roberts, A. Prodromou, A. Venkat, D. Tullsen,
IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 8, pp. 1944–1957,
and R. Gupta, ‘‘Reliability-aware data placement for heterogeneous mem-
Aug. 2013.
ory architecture,’’ in Proc. IEEE Int. Symp. High Perform. Comput. Archit.
[39] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, ‘‘Gradient-based learn-
(HPCA), Feb. 2018, pp. 583–595.
ing applied to document recognition,’’ Proc. IEEE, vol. 86, no. 11,
[18] A. Ros, M. E. Acacio, and J. M. Garcia, ‘‘DiCo-CMP: Efficient cache
pp. 2278–2324, Nov. 1998.
coherency in tiled CMP architectures,’’ in Proc. IEEE Int. Symp. Parallel
[40] M. Abadi et al., ‘‘TensorFlow: A system for large-scale
Distrib. Process., Apr. 2008, pp. 1–11.
machine learning,’’ in Proc. 12th USENIX Symp. Oper. Syst.
[19] I.-C. Lin and J.-N. Chiou, ‘‘High-endurance hybrid cache design in CMP
Design Implement. (OSDI), Savannah, GA, USA, Nov. 2016,
architecture with cache partitioning and access-aware policies,’’ IEEE
pp. 265–283. [Online]. Available: https://www.usenix.org/conference/
Trans. Very Large Scale Integr. (VLSI) Syst., vol. 23, no. 10, pp. 2149–2161,
osdi16/technical-sessions/presentation/abadi
Oct. 2015.
[20] U. Milic, A. Rico, P. Carpenter, and A. Ramirez, ‘‘Sharing the instruction
cache among lean cores on an asymmetric CMP for HPC applications,’’
in Proc. IEEE Int. Symp. Perform. Anal. Syst. Softw. (ISPASS), Apr. 2017, HYEONGUK JANG received the B.S. and
pp. 3–12. M.S. degrees in electrical engineering from
[21] G. G. Shahidi, ‘‘Chip power scaling in recent CMOS technology nodes,’’ Gyeongsang National University, Jinju, South
IEEE Access, vol. 7, pp. 851–856, 2019. Korea, in 2013 and 2015, respectively. He is cur-
[22] M. Ansari, M. Pasandideh, J. Saber-Latibari, and A. Ejlali, ‘‘Meeting rently pursuing the Ph.D. degree with the Uni-
thermal safe power in fault-tolerant heterogeneous embedded systems,’’ versity of Science and Technology. He has been
IEEE Embedded Syst. Lett., vol. 12, no. 1, pp. 29–32, Mar. 2020. with the SoC Design Research Group, Electronics
[23] S. Chakraborty and H. K. Kapoor, ‘‘Exploring the role of large centralised and Telecommunications Research Institute. His
caches in thermal efficient chip design,’’ ACM Trans. Design Autom. research interests include network-on-chip and
Electron. Syst., vol. 24, no. 5, pp. 1–28, Oct. 2019. system software in embedded systems.
[24] M. Rapp, M. Sagi, A. Pathania, A. Herkersdorf, and J. Henkel,
‘‘Power- and cache-aware task mapping with dynamic power budget-
ing for many-cores,’’ IEEE Trans. Comput., vol. 69, no. 1, pp. 1–13,
Jan. 2020. KYUSEUNG HAN (Member, IEEE) received
[25] B. Choi, R. Komuravelli, H. Sung, R. Smolinski, N. Honarmand, the B.S. and Ph.D. degrees in electrical engi-
S. V. Adve, V. S. Adve, N. P. Carter, and C.-T. Chou, ‘‘DeNovo: Rethinking neering and computer science from Seoul
the memory hierarchy for disciplined parallelism,’’ in Proc. Int. Conf. National University (SNU), Seoul, South Korea,
Parallel Archit. Compilation Techn., Oct. 2011, pp. 155–166. in 2008 and 2013, respectively. At SNU,
[26] J. Cai and A. Shrivastava, ‘‘Software coherence management on non- he researched on computer architecture and design
coherent cache multi-cores,’’ in Proc. 29th Int. Conf. VLSI Design, 15th automation. Since 2014, he has been working with
Int. Conf. Embedded Syst. (VLSID), Jan. 2016, pp. 397–402. the Electronics and Telecommunications Research
[27] K. Han, J.-J. Lee, and W. Lee, ‘‘Converting interfaces on application- Institute (ETRI), Daejeon, South Korea. He cur-
specific network-on-chip,’’ J. Semicond. Technol. Sci., vol. 17, no. 4, rently belongs to the SoC Design Research Group
pp. 505–513, Aug. 2017. as a Senior Researcher. His current research interests include reconfigurable
[28] H. Jang, K. Han, S. Lee, J.-J. Lee, and W. Lee, ‘‘MMNoC: Embedding architecture, network-on-chip, and ultra-low-power techniques in embedded
memory management units into network-on-chip for lightweight embed- systems.
ded systems,’’ IEEE Access, vol. 7, pp. 80011–80019, 2019.
[29] D. Petrisko, F. Gilani, M. Wyse, D. C. Jung, S. Davidson, P. Gao,
C. Zhao, Z. Azad, S. Canakci, B. Veluri, T. Guarino, A. Joshi, M. Oskin,
SUKHO LEE received the Ph.D. degree in
and M. B. Taylor, ‘‘BlackParrot: An agile open-source RISC-V mul-
ticore for accelerator SoCs,’’ IEEE Micro, vol. 40, no. 4, pp. 93–102, information communications engineering from
Jul. 2020. Chungnam National University, Daejeon, South
[30] L. Chen, D. Zhu, M. Pedram, and T. M. Pinkston, ‘‘Power punch: Korea, in 2010. He is currently a Principal
Towards non-blocking power-gating of NoC routers,’’ in Proc. HPCA, Researcher with the SoC Design Research Group,
2015, pp. 378–389. Electronics and Telecommunications Research
[31] K. Han, J.-J. Lee, J. Lee, W. Lee, and M. Pedram, ‘‘TEI-NoC: Optimiz- Institute, Daejeon. His current research interests
ing ultralow power NoCs exploiting the temperature effect inversion,’’ include ultra-low-power system-on-chip design,
IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 37, no. 2, embedded system design, video codec design, and
pp. 458–471, Feb. 2018. video image processing.
JAE-JIN LEE received the B.S., M.S., and Ph.D. JAE-HYOUNG LEE (Student Member, IEEE)
degrees in computer engineering from Chungbuk received the B.S. degree from Myoungji University,
National University, in 2000, 2003, and 2007, Yong-In, South Korea, in 2020. He is currently
respectively. He is currently a Group Leader with pursuing the M.S. degree in electrical and elec-
the SoC Design Research Group, Electronics and tronics engineering with Chung-Ang University.
Telecommunications Research Institute, and also a He is currently a Beneficiary Student of the
Professor with the Department of ICT, University High-Potential Individuals Global Training Pro-
of Science and Technology. His research interests gram. His research interests include low-power
include processor and compiler designs in ultra- design, SoC architecture, and embedded systems.
low-power embedded systems.