Low-Energy Reduced RISC-V Instruction Subset Processor for Tsetlin Machine Inference at the Edge
Abstract
Tsetlin Machine (TM) is a logic-based machine learning approach that relies on simple bitwise operations and finite-state automata, which makes it attractive for edge AI deployments. Recent work has focused on co-processor and accelerator designs based on Tsetlin Machines (TMs). Although these designs achieve high performance, they typically depend on tightly coupled interfaces, microcode-style programming, and external host processors, limiting flexibility and ease of programming. In this work, we present a domain-specific RISC-V microprocessor architecture and design flow tailored for TM inference. Leveraging the modular structure of RISC-V, we design a reduced instruction subset processor that retains programmability while targeting improved performance and lower energy consumption for TM workloads. Instruction profiling is employed to guide instruction reduction, followed by datapath and control path simplifications tailored to TM inference. Both the baseline RV32IM core and the proposed reduced core are evaluated across multiple datasets and compared with Binarized Neural Networks (BNNs), which serve as a hardware-efficient baseline due to their reliance on bitwise operations during inference. Results show that TM achieves comparable or higher accuracy (e.g., up to 88.18% on CIFAR-2 compared to 60.0% for BNN) while reducing execution time by up to 98% across multiple datasets. Furthermore, the proposed design achieves an average reduction in energy consumption, demonstrating its effectiveness for programmable and efficient edge AI systems.
I Introduction
The rapid growth of edge computing in domains such as the Internet of Things (IoT), smart sensing, and autonomous systems has created a strong demand for energy-efficient and low-complexity processors capable of performing edge Artificial Intelligence (AI) inference. To meet these requirements, researchers have explored specialized hardware architectures for accelerating machine learning workloads [7, 15]. One common approach is the use of Application-Specific Integrated Circuits (ASICs), which implement fixed-function hardware tailored to specific algorithms [4]. Processors like Apple’s Neural Engine and Google’s Tensor Processing Unit (TPU) have shown high-performance AI inference using ASIC-based designs. However, these designs usually lack the flexibility needed to accommodate changing machine learning models because they are optimized for a limited set of operations. To overcome this limitation, Application-Specific Instruction-set Processors (ASIPs) extend conventional processors with customized instruction sets and specialized datapaths while preserving programmability. Dedicated hardware support for signal processing and low-power computation is offered by embedded systems like the Texas Instruments MSP430 microcontroller and its Low Energy Accelerator (LEA) [15]. Despite these benefits, ASIP-based systems can increase design complexity and require careful hardware-software co-design.
The idea of open hardware has recently gained traction, allowing researchers to create more accessible and customizable CPU architectures [19, 3]. Among these, the open-source RISC-V instruction set architecture (ISA) has gained a lot of interest because of its modular design and support for custom extensions, which allow for the creation of domain-specific processors while maintaining programmability. Neural network models, such as Transformer architectures and Convolutional Neural Networks (CNNs), are the main focus of the majority of current domain-specific processor research [3]. However, these models rely on compute-intensive operations and large memory footprints, resulting in complex processor implementations and high computational costs.
Tsetlin Machine (TM) has recently emerged as a novel machine learning paradigm based on propositional logic rather than arithmetic computation [6]. Built on Boolean expressions and simple decision rules, TM models have been shown in prior literature to achieve low computational complexity, high interpretability, and significant energy efficiency [5, 14]. These characteristics make TM well-suited for resource-constrained edge AI systems. However, efficiently mapping such algorithms onto programmable processor architectures remains an open challenge. Our hypothesis is that such an approach can enable processor architectures with ultra-low energy consumption, fast operation, and reduced overall system overhead. For comparative analysis, we consider a neural network-based characterization using Binarized Neural Networks (BNNs) [2], which utilize binary weights and activations for inference, and perform comprehensive evaluations.
We make the following key contributions:
-
•
Design of a programmable and high-performance domain-specific RISC-V architecture tailored for Tsetlin Machine inference.
-
•
Extensive validation of the proposed architecture across multiple TM workloads, demonstrating the trade-offs between performance, accuracy, and programmability.
-
•
Comparative analysis with Binarized Neural Networks (BNNs) on RISC-V processors to evaluate performance and energy efficiency.
II Background
This section presents background on the TM model, BNNs and RISC-V for domain-specific architectures.
II-A Tsetlin Machine Model and BNNs
Lightweight machine learning models are increasingly explored for edge intelligence due to their reduced computational requirements. Among these, Binarized Neural Networks (BNNs) and Tsetlin Machines (TM) have drawn interest as effective substitutes for traditional deep learning models. Tsetlin Machine (TM) is an logic-based machine learning model that uses propositional logic for pattern recognition tasks [6]. Its core components include Tsetlin Automata (TA) organized into conjunctive clauses, along with summation, threshold, and feedback modules. BNNs are derived from conventional deep neural networks by constraining weights and activations to binary values. They employ multi-layer architectures and backpropagation-based training, replacing costly arithmetic operations with efficient bitwise operations. This significantly reduces computational complexity and memory usage, making BNNs suitable for resource-constrained edge environments. For multi-class classification, we employ the Multiclass TM, where each class is represented by a set of clauses. Since most real-world tasks involve multiple classes, the terms TM and Multiclass TM are used interchangeably in this work. As illustrated in Fig. 1, a Multiclass TM consists of classes, each containing clauses. Each clause is constructed from a group of TAs, where the number of automata equals twice the number of Boolean input features to account for both inputs and their negations. A Booleanizer converts raw input features into binary values and their complements, which serve as inputs to the clause computation module. For each datapoint, the clause computation stage produces a 1-bit output per clause based on the collective decisions of the associated TAs. Each class contains an even number of clauses with alternating polarities of and , indicating whether a clause supports or opposes the classification outcome. The clause outputs are aggregated through a summation process to compute a class sum for each class, and an operation selects the class with the highest sum as the predicted output. Using straightforward logical procedures, this voting-based system allows for effective and comprehensible classification. The threshold and the specificity parameter are two important hyperparameters in TM. While establishes the likelihood of using literals in clauses, the regulates the number of clauses that take part in voting.
II-B RISC-V for Domain-Specific Architectures
An open-source, modular, extensible platform for processor design is offered by RISC-V, which enables designers to set up cores for domain-specific applications [3]. A basic standardized base instruction set, such as RV32I, is defined by the RISC-V ISA and can be expanded or decreased. It provides fine-grained control over memory management, execution units, and the processor pipeline to optimize deterministic machine learning workloads. In addition, rapid development and deployment are supported by established toolchains of RISC-V, including open-source cores and GCC-based compilers. Furthermore, efficient mapping of application-specific workloads onto hardware is made possible by the ability to customize both the instruction subset and microarchitecture, which minimizes needless computational cost.Therefore, RISC-V serves as an efficient hardware platform for logic-based, lightweight machine learning models such as the Tsetlin Machine. Scalable edge AI deployments require a balance between programmability and hardware efficiency, which is made possible by such domain-specific customization.
III Proposed Reduced RISC-V Design Flow
This section presents the proposed RISC-V design flow for Tsetlin Machine inference, organized into three stages.
III-A Overview of the Proposed Design Flow
The proposed design flow consists of three stages, as shown in Fig. 2. Stage 1 performs dataset Booleanization and Tsetlin Machine (TM) model generation. Stage 2 focuses on instruction profiling to enable ISA reduction along with initial datapath and control path simplifications. Stage 3 applies further refinement to the datapath and control logic to reduce complexity and improve efficiency for TM inference on the RISC-V processor.
III-B Tsetlin Machine Model Generation
We implemented multiple inference strategies on RISC-V cores for comprehensive evaluation. Two Tsetlin Machine approaches (T1 and T2) were used, differing in clause storage and evaluation, along with a baseline BNN inference for fair comparison due to its similar bitwise operations.
1) Vanilla TM inference (T1): Vanilla TM inference refers to the standard clause evaluation process without any optimization techniques. In this strategy, each clause is evaluated by sequentially checking every literal for inclusion and then comparing it against the input literal vector. After clause evaluation, the vote sum is computed sequentially for each class, and the class with the highest vote sum is selected as the output. The literal states (include/exclude) are stored in memory for each clause as boolean values. This procedure is formalized in Algorithm 1. At lines –, the algorithm iterates over each class and clause, evaluating useful clauses by checking literal inclusion and updating the clause output through logical operations, while non-useful clauses are skipped to avoid unnecessary computation. Subsequently, lines – compute the vote total for each class by incrementing or decrementing votes based on clause polarity. Finally, at line , the class with the maximum vote value is selected as the predicted output. It is implemented in C++, and the RISC-V GNU toolchain [13] is used for cross-compilation.
2) Modified TM inference (T2): To improve efficiency, we implemented a second inference strategy inspired by [11], in which each clause stores only the indices of literals that are actually included, encoded as half-words (16-bit values). Only these indices are accessed and assessed during inference. The number of memory accesses, memory usage, and logical comparisons needed for clause evaluation are all greatly decreased by this sparse clause structure. For example, even if only 10 literals are active in a sentence containing 128 literals, the usual technique needs keeping 128 inclusion flags, each taking up a byte (in C++ Inference). The improved version, on the other hand, only needs to hold ten indices, each of which usually requires two bytes, greatly lowering the memory footprint in these situations. The improved strategy ensures functional equivalency with the traditional approach while preserving the semantics of clause evaluation in spite of this compression. This procedure is formalized in Algorithm 2. At lines –, the algorithm iterates over each class and clause, evaluating only the literals whose indices are stored in memory. If a clause contains included literals, the clause output is initialized and updated by sequentially checking the referenced literal values; otherwise, the clause output is set to zero to avoid unnecessary computation. Subsequently, lines – compute the vote total for each class by adjusting votes based on clause polarity. Finally, at line , the class with the maximum vote value is selected as the predicted output.
3) Binarized Neural Networks (BNNs): To ensure a consistent and fair comparison between TM and BNN inference, a baseline BNN inference approach was implemented in C++ and cross-compiled in a similar manner. We use BNNs as a hardware-efficient baseline due to their bitwise operation-based inference. In this approach, inference is performed sequentially across layers, where each neuron evaluates the binary outputs from the previous layer against the corresponding binary weights. Matching pairs contribute positively to an accumulation sum, while mismatches reduce it, effectively realizing XNOR-based comparisons followed by integer accumulation. The accumulated value is then evaluated against a zero threshold to determine the neuron activation. This process continues until the final layer produces class scores, and the class with the highest score is selected as the predicted output.
III-C Software Compilation Flow and Model Deployment
Tsetlin Machine models were trained using the TMU (Tsetlin Machine Unified) Python library [1]. Multiple datasets and clause configurations were used to evaluate performance across diverse conditions. BNN models were trained using Larq, a TensorFlow framework for BNNs, and model weights were extracted. The extracted clauses and weights were strategically placed in the data-memory (DMEM) alongside the test input before simulation. Inference logic was implemented in C++, cross-compiled using the same RISC-V toolchain, and converted into instruction memory using the same assembly flow. This ensured a consistent and fair comparison between TM and BNN inference in terms of memory structure, control flow, and execution overhead.
III-D Instruction Profiling
Instruction profiling was performed by analyzing the assembly generated from TM inference to identify frequently used instructions. In the absence of function calls in the RISC-V simulation, stack behavior is emulated by initializing the stack pointer and assigning memory addresses to parameters and local variables. Pseudo-instructions are converted into base RISC-V ISA instructions for compatibility with the reduced instruction subset. The conversions are as follows:
-
(i)
li x, imm addi x, x0, imm
-
(ii)
mv x, y addi x, y, 0
-
(iii)
beqz x, label beq x, x0, label
-
(iv)
snez x, y sltu x, x0, y
-
(v)
zext.b x, y andi x, y, 0xff
-
(vi)
bge x, y, label blt y, x, label
-
(vii)
nop add x0, x0, x0
The impact of reducing high frequency instructions is shown in Fig. 3(a), demonstrating significant instruction count reduction using the modified approach. A subset of instructions contributes most to this reduction. The trend of absolute reduction with increasing clauses is illustrated in Fig. 3(b).
III-E Domain-Specific Design and Optimization
We utilised two versions of a single-core RISC-V processor from the UltraEmbedded open-source GitHub repository [16]. Both cores were synthesized and simulated using Xilinx Vivado, enabling us to accurately capture instruction execution behavior and analyze resource utilization.
III-E1 Original RISC-V Core (R1)
The original RISC-V core is a full RV32IM implementation, supporting the RV32I base instruction set along with the multiplication division extension (M) and the Control and Status Register (ZICSR) extensions totaling 59 instructions. It follows a classic 5-stage pipeline (Fetch, Decode, Execute, Memory, Writeback) with support for result forwarding, configurable pipeline depth, and privilege modes (user, supervisor, and machine). The ZICSR extension enables access to control and status registers, while memory can be interfaced via instruction/data caches or tightly coupled memories (TCMs) using an Advanced eXtensible Interface (AXI) bus interface. While Fig. 2 presents the overall three-stage design methodology, Fig. 4 illustrates the implementation workflow and system architecture used for executing TM inference on the RISC-V processor. In our setup, the core was connected to a tightly coupled instruction and data memories via the AXI interface. The compiled inference code was loaded into instruction memory, while clause data, input features, and output buffers placed in data memory as shown in Fig. 4.
III-E2 Design Modified RISC-V Core (R2)
The modified core is a simplified variant that supports only the instructions required for TM and BNN inference, reducing control complexity and unnecessary hardware. Profiling (Section III-D) identifies frequently used instructions and critical datapath and control paths. Optimization is then focused on these key components, followed by revisiting the design to evaluate improvements in efficiency and performance. To strip down the core, we used the following strategy:
-
a.
remove instruction definitions and flags,
-
b.
prune decode and execution logic for those instructions,
-
c.
simplify control and data paths.
The core supported a total of 27 instructions after reduction.
-
•
Base ISA Instructions (21): andi, addi, xori, slli, add, lui, sltu, or, and, jal, beq, bne, blt, lw, lbu, lhu, sb, sw, ecall, ebreak, eret
-
•
ZICSR Extension Instructions (6): csrrw, csrrs, csrrc, csrrwt, csrrsi, csrrci
IV Experimental Results
The designs were synthesized in Xilinx Vivado targeting the Zynq-7000 ZC702 Evaluation Board, and the on-chip power consumption of both cores was analyzed. The original RISC-V core (R1) exhibited a total on-chip power consumption of , while the modified RISC-V core (R2) consumed . These measurements were obtained under the following environmental and thermal conditions: (i) junction temperature of , (ii) thermal margin of (), and (iii) ambient temperature of .
The post-synthesis timing analysis revealed that the core R1 exhibited a longest path delay of , whereas the modified core (R2) showed a reduced path delay of , resulting in faster operation. Consequently, clock periods of and were used for simulations of R1 and R2, respectively, to ensure a positive Worst Negative Slack (WNS) in both cases. Execution time was measured for all inference scenarios as shown in Fig. 5, where Time (ms) is plotted on a logarithmic scale. While both cores required the same number of clock cycles, their differing clock periods resulted in different execution times, highlighting the performance gain from the reduction. We evaluated inference strategies T1 and T2 on six datasets (Table I) using both the cores R1 and R2. For each dataset, we analyzed the accuracy of the TM model (Table II) across varying numbers of clauses and compared it with the accuracy of a BNN model. The resource utilization of the unmodified core (R1) and the reduced core (R2) was analyzed post-synthesis (Table III). The number of LUTs was reduced by 28.4%, and the number of FFs by 12.6%. The DSP usage saw a 100% reduction due to the removal of the multiplier and divider units in R2. TM inference strategies, T1 (Vanilla TM) and T2 (Modified TM), were simulated on both cores R1 and R2 across all clause configurations. Additionally, BNN inference was executed for each dataset on both cores. Using a hardware-software co-design approach, we modified the TM inference to reduce memory usage and improve execution speed. Subsequently, the RISC-V core was streamlined to support a limited set of instructions tailored for TM inference. This co-optimization enabled a maximum reduction of 99.28% in execution time and an average reduction of 96.55% across all inference scenarios.
| Dataset | Description | #Class | #Features | #Samples |
| CIFAR-2 [9] | 2-class CIFAR animal vs. non-animal | 2 | 324 | 60000 |
| Statlog [12] | 3-D and 2-D vehicle images | 4 | 360 | 846 |
| Gesture [10] | Hand, wrist, head, and spine positions | 5 | 180 | 9901 |
| Gas [17] | Chemical sensor data for 6 gases | 6 | 128 | 13900 |
| EMG [8] | EMG data for static hand gestures | 8 | 160 | 14232 |
| FMNIST [18] | 28×28 grayscale fashion images | 10 | 784 | 70000 |
| Clauses per Class | |||||||
| Dataset | 50 | 100 | 150 | 200 | 250 | 300 | BNN |
| CIFAR-2 [9] | 78.78 | 83.5 | 85.9 | 86.95 | 87.75 | \cellcolorlightgreen88.18 | 60.0 |
| Statlog [12] | 76.5 | 78.8 | 78.2 | 80.0 | \cellcolorlightgreen80.6 | 80.0 | 72.4 |
| Gesture [10] | 64.2 | 68.0 | 71.0 | 73.6 | 75.0 | 75.9 | \cellcolorlightgreen76.5 |
| Gas [17] | 72.9 | 85.0 | 85.5 | 86.3 | \cellcolorlightgreen86.9 | \cellcolorlightgreen86.9 | 85.8 |
| EMG [8] | 80.3 | 84.6 | 85.4 | 86.0 | \cellcolorlightgreen86.1 | 85.8 | 83.9 |
| FMNIST [18] | 82.72 | 84.31 | 84.72 | 84.77 | 84.2 | \cellcolorlightgreen84.81 | 80.04 |
| Resources | Availability | Usage (R1) | R1% | Usage (R2) | R2% |
| LUT | 53200 | 3493 | 6.57% | 2501 | 4.70% |
| FF | 106400 | 2393 | 4.70% | 2092 | 1.97% |
| DSP | 200 | 4 | 1.82% | 0 | 0% |
We also observed that for similar accuracy, TM consistently outperforms BNNs in terms of execution time, with the modified TM inference showing a significant performance advantage over BNN. To estimate the energy consumed during inference, we use the relation , where denotes energy (mJ), represents total on-chip power (W), and is the execution time (ms). The plots in Fig. 6 illustrate the variation in energy consumption of TM inference with respect to the number of clauses, with Energy (mJ) plotted on a logarithmic scale. The dotted horizontal lines indicate the corresponding energy consumption of BNN inference for comparison. Reducing the instruction set in the core resulted in a power reduction of (from to ), amounting to a 2.18% decrease in hardware power consumption. When combined with the inference algorithm optimization, we observed an average energy reduction of 96.63%, with a maximum reduction of 99.3% across all datasets and clause configurations.
V Conclusions and Future Work
This work addresses the challenge of designing a programmable processor with low energy, fast operation, and reduced overhead for edge AI inference. We presented a domain-specific RISC-V design flow for Tsetlin Machine (TM) inference, guided by instruction profiling and ISA reduction. Results show that TM achieves comparable or higher accuracy than BNNs, with up to 98% lower execution time and 29.7 energy savings. This demonstrates the effectiveness of the reduced programmable RISC-V architecture for edge AI. Future work will explore custom instructions, architectural simplifications, and a design automation flow for RISC-V-based domain-specific processors targeting TM workloads.
References
- [1] (2025) TMU. Note: [Online]. Available: \urlhttps://github.com/cair/tmu/tree/main[Accessed:Feb,2026] Cited by: §III-C.
- [2] (2016) Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to + 1 or-1. External Links: 1602.02830, Link Cited by: §I.
- [3] (2023) RISC-V Instruction Set Architecture Extensions: A Survey. IEEE Access 11, pp. 24696–24711. Cited by: §I, §II-B.
- [4] (2024) 1.1 The Deep Learning Revolution and its Implications for Computer Architecture and Chip Design. In Proc. of the ISSCC, pp. 8–14. Cited by: §I.
- [5] (2025) ETHEREAL: Energy-efficient and High-throughput Inference using Compressed Tsetlin Machine. In Proc. of the IWASI, pp. 1–6. Cited by: §I.
- [6] (2021) The Tsetlin Machine – A Game Theoretic Bandit Driven Approach to Optimal Pattern Recognition with Propositional Logic. External Links: 1804.01508, Link Cited by: §I, §II-A.
- [7] (2024) Hardware Accelerator for MobileViT Vision Transformer with Reconfigurable Computation. In Proc. of the ISCAS, pp. 1–4. Cited by: §I.
- [8] (2019) EMG Data for Gestures. Note: UCI Machine Learning Repository, [Online]. Available: \urlhttps://archive.ics.uci.edu/ml/datasets/EMG+data+for+gestures[Accessed: Feb, 2026] Cited by: TABLE I, TABLE II.
- [9] (2017) CIFAR-2. Note: [Online]. Available: \urlhttps://keras.io/api/datasets/cifar10/[Accessed:Feb,2026] Cited by: TABLE I, TABLE II.
- [10] (2014) Gesture Phase Segmentation. Note: UCI Machine Learning Repository, [Online]. Available: \urlhttps://archive.ics.uci.edu/ml/datasets/Gesture+Phase+Segmentation[Accessed: Feb, 2026] Cited by: TABLE I, TABLE II.
- [11] (2023) REDRESS: Generating Compressed Models for Edge Inference Using Tsetlin Machines. IEEE TPAMI 45 (9), pp. 11152–11168. Cited by: §III-B.
- [12] (2023) Statlog (Vehicle Silhouettes). Note: UCI Machine Learning Repository, [Online]. Available: \urlhttps://archive.ics.uci.edu/ml/datasets/Statlog+(Vehicle+Silhouettes)[Accessed:Feb,2026] Cited by: TABLE I, TABLE II.
- [13] (2025) RISC-V GNU Toolchain. Note: [Online]. Available: \urlhttps://github.com/riscv-collab/riscv-gnu-toolchain[Accessed:Feb,2026] Cited by: §III-B.
- [14] (2026) Learning Dynamics, Pattern Recognition Capability and Interpretability of the Tsetlin Machine. Pattern Recognition 174, pp. 113028. Cited by: §I.
- [15] (2025) Tsetlin Machine-Based Image Classification FPGA Accelerator With On-Device Training. IEEE Transactions on Circuits and Systems I: Regular Papers 72 (2), pp. 830–843. Cited by: §I.
- [16] (2025) RISC-V. Note: [Online]. Available: \urlhttps://github.com/ultraembedded/riscv[Accessed:Feb,2026] Cited by: §III-E.
- [17] (2012) Gas Sensor Array Drift Dataset. Note: UCI Machine Learning Repository, [Online]. Available: \urlhttps://archive.ics.uci.edu/ml/datasets/Gas+Sensor+Array+Drift+Dataset[Accessed: Feb, 2026] Cited by: TABLE I, TABLE II.
- [18] (2017) Fashion-MNIST. Note: Keras, [Online]. Available: \urlhttps://keras.io/api/datasets/fashion_mnist/[Accessed: Feb, 2026] Cited by: TABLE I, TABLE II.
- [19] (2025) Fast and Compact Tsetlin Machine Inference on CPUs Using Instruction-Level Optimization. In Proc. of the ISTM, pp. 44–47. Cited by: §I.