Want a faster ML processor? Do it yourself! -- A framework for playing with custom opcodes to accelerate TensorFlow Lite for Microcontrollers (TFLM).

Overview

CFU Playground

Want a faster ML processor? Do it yourself!

This project provides a framework that an engineer, intern, or student can use to design and evaluate enhancements to an FPGA-based “soft” processor, specifically to increase the performance of machine learning (ML) tasks. The goal is to abstract away most infrastructure details so that the user can get up to speed quickly and focus solely on adding new processor instructions, exploiting them in the computation, and measuring the results.

This project enables rapid iteration on processor improvements -- multiple iterations per day.

This is how it works:

  • Choose a TensorFlow Lite model; a quantized person detection model is provided, or bring your own.
  • Execute the inference on the Arty FPGA board to get cycle counts per layer.
  • Choose an TFLite operator to accelerate, and dig into that code.
  • Design new instruction(s) that can replace multiple basic operations.
  • Build a custom function unit (a small amount of hardware) that performs the new instruction(s).
  • Modify the TFLite/Micro library kernel to use the new instruction(s), which are available as intrinsics with function call syntax.
  • Rebuild the FPGA Soc, recompile the TFLM library, and rerun to measure improvement.

The focus here is performance, not demos. The inputs to the ML inference are canned/faked, and the only output is cycle counts. It would be possible to export the improvements made here to an actual demo, but currently no pathway is set up for doing so.

With the exception of Vivado, everything used by this project is open source.

Disclaimer: This is not an officially supported Google project. Support and/or new releases may be limited.

This is an early prototype of a ML exploration framework; expect a lack of documentation and occasional breakage. If you want to collaborate on building out this framework, reach out to [email protected]! See "Contribution guidelines" below.

Required hardware/OS

  • Currently, the only supported target is the Arty 35T board from Digilent.
  • The only supported host OS is Linux (Debian / Ubuntu).

If you want to try things out using Renode simulation, then you don't need either the Arty board or Vivado software. You can also perform Verilog-level cycle-accurate simulation with Verilator, but this is much slower.

Assumed software

  • Vivado must be manually installed.

Other required packages will be checked for and, if on a Debian-based system, automatically installed by the setup script below.

Setup

Clone this repo, cd into it, then get run:

scripts/setup

Use

Build the SoC and load the bitstream onto Arty:

cd proj/proj_template
make prog

This builds the SoC with the default CFU from proj/proj_template. Later you'll copy this and modify it to make your own project.

Build a RISC-V program and execute it on the SoC that you just loaded onto the Arty:

make load

To use Renode to execute on a simulator on the host machine (no Vivado or Arty board required), execute:

make renode

To use Verilator to execute on a cycle-accurate RTL-level simulator (no Vivado or Arty board required), execute:

make PLATFORM=sim load

Underlying open-source technology

  • LiteX: Open-source framework for assembling the SoC (CPU + peripherals)
  • VexRiscv: Open-source RISC-V soft CPU optimized for FPGAs
  • nMigen: Python toolbox for building digital hardware

Licensed under Apache-2.0 license

See the file LICENSE.

Contribution guidelines

If you want to contribute to CFU Playground, be sure to review the contribution guidelines. This project adheres to Google's code of conduct. By participating, you are expected to uphold this code.

Comments
  • proj_template/ make prog  doesn't select which python3 it needs

    proj_template/ make prog doesn't select which python3 it needs

    I've run sauron:~/fpga/CFU-Playground$ ./scripts/setup and it worked fine.

    Then

    sauron:~/fpga/CFU-Playground/proj/proj_template$ make prog
    (...)
    INFO:SoC:IRQ Handler (up to 32 Locations).
    IRQ Locations: (2)
    - uart   : 0
    - timer0 : 1
    INFO:SoC:--------------------------------------------------------------------------------
    Traceback (most recent call last):
      File "./common_soc.py", line 54, in <module>
        main()
      File "./common_soc.py", line 50, in main
        workflow.run()
      File "/home/merlin/fpga/CFU-Playground/soc/board_specific_workflows/general.py", line 114, in run
        self.load(soc, soc_builder)
      File "/home/merlin/fpga/CFU-Playground/soc/board_specific_workflows/general.py", line 103, in load
        prog.load_bitstream(bitstream_filename)
      File "/home/merlin/fpga/CFU-Playground/third_party/python/litex/litex/build/openocd.py", line 21, in load_bitstream
        config = self.find_config()
      File "/home/merlin/fpga/CFU-Playground/third_party/python/litex/litex/build/generic_programmer.py", line 72, in find_config
        import requests
    ModuleNotFoundError: No module named 'requests'
    sauron:~/fpga/CFU-Playground/proj/proj_template$ type python3
    python3 is hashed (/opt/symbiflow/xc7/conda/envs/xc7/bin/python3)
    

    ../../soc/common_soc.py runs env python3 but I found nothing in https://cfu-playground.readthedocs.io/en/latest/setup-guide.html that selects/chooses which python3 should be run.

    The system python3 works worse

        raise YosysError("Could not find an acceptable Yosys binary. The `nmigen-yosys` PyPI "
    nmigen._toolchain.yosys.YosysError: Could not find an acceptable Yosys binary. The `nmigen-yosys` PyPI package, if available for this platform, can be used as fallback
    

    and the fomu python3 doesn't work either

    sauron:~/fpga/CFU-Playground/proj/proj_template$ type python3
    python3 is /home/merlin/fpga/fomu-toolchain-linux/bin/python3
    sauron:~/fpga/CFU-Playground/proj/proj_template$ make prog
    /home/merlin/fpga/CFU-Playground/scripts/pyrun /home/merlin/fpga/CFU-Playground/proj/proj_template/cfu_gen.py
    Traceback (most recent call last):
      File "/home/merlin/fpga/CFU-Playground/proj/proj_template/cfu_gen.py", line 38, in <module>
        main()
      File "/home/merlin/fpga/CFU-Playground/proj/proj_template/cfu_gen.py", line 31, in main
        new_verilog = verilog.convert(cfu, name='Cfu', ports=cfu.ports)
      File "/home/merlin/fpga/CFU-Playground/third_party/python/nmigen/nmigen/back/verilog.py", line 61, in convert
        return _convert_rtlil_text(rtlil_text, strip_internal_attrs=strip_internal_attrs)
      File "/home/merlin/fpga/CFU-Playground/third_party/python/nmigen/nmigen/back/verilog.py", line 10, in _convert_rtlil_text
        yosys = find_yosys(lambda ver: ver >= (0, 9))
      File "/home/merlin/fpga/CFU-Playground/third_party/python/nmigen/nmigen/_toolchain/yosys.py", line 228, in find_yosys
        raise YosysError("Could not find an acceptable Yosys binary. The `nmigen-yosys` PyPI "
    nmigen._toolchain.yosys.YosysError: Could not find an acceptable Yosys binary. The `nmigen-yosys` PyPI package, if available for this platform, can be used as fallback
    make: *** [../proj.mk:187: generate_cfu] Error 1
    
    opened by marcmerlin 24
  • Verilator compile in Conda env can't find the compiler

    Verilator compile in Conda env can't find the compiler

    To reproduce: go to this colab: https://colab.research.google.com/drive/1_GlX-pO4rune8GMIK4q_IuhxpIhnPTni?resourcekey=0-jMbF8wzg0csZu0fxpOy78A&usp=sharing (currently only readable within Google)

    Then select "Runtime" --> "Run all"

    Output:

    make -j -C /content/CFU-Playground/soc/build/sim.proj_template_v/gateware/obj_dir -f Vsim.mk Vsim
    make[3]: Entering directory '/content/CFU-Playground/soc/build/sim.proj_template_v/gateware'
    make[3]: warning: jobserver unavailable: using -j1.  Add '+' to parent make rule.
    Vsim.mk:13: warning: NUL character seen; rest of line ignored
    ccache x86_64-conda-linux-gnu-c++ -fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /content/CFU-Playground/env/conda/envs/cfu-common/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /content/CFU-Playground/env/conda/envs/cfu-common/include -I.  -MMD -I/content/CFU-Playground/env/conda/envs/cfu-common/share/verilator/include -I/content/CFU-Playground/env/conda/envs/cfu-common/share/verilator/include/vltstd -DVM_COVERAGE=0 -DVM_SC=0 -DVM_TRACE=1 -DVM_TRACE_FST=1 -faligned-new -fcf-protection=none -Wno-bool-operation -Wno-sign-compare -Wno-uninitialized -Wno-unused-but-set-variable -Wno-unused-parameter -Wno-unused-variable -Wno-shadow     -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /content/CFU-Playground/env/conda/envs/cfu-common/include -ggdb -Wall -O0   -DTRACE_FST -I/content/CFU-Playground/third_party/python/litex/litex/build/sim/core  -std=gnu++14 -Os -c -o veril.o /content/CFU-Playground/third_party/python/litex/litex/build/sim/core/veril.cpp
    ccache: error: Could not find compiler "x86_64-conda-linux-gnu-c++" in PATH
    Vsim.mk:71: recipe for target 'veril.o' failed
    make[3]: *** [veril.o] Error 1
    make[3]: Leaving directory '/content/CFU-Playground/soc/build/sim.proj_template_v/gateware/obj_dir'
    /content/CFU-Playground/third_party/python/litex/litex/build/sim/core/Makefile:38: recipe for target 'sim' failed
    make[2]: *** [sim] Error 2
    make[2]: *** Waiting for unfinished jobs....
    make[4]: Leaving directory '/content/CFU-Playground/soc/build/sim.proj_template_v/gateware/modules/gmii_ethernet'
    cp gmii_ethernet/gmii_ethernet.so gmii_ethernet.so
    make[3]: Leaving directory '/content/CFU-Playground/soc/build/sim.proj_template_v/gateware/modules'
    make[2]: Leaving directory '/content/CFU-Playground/soc/build/sim.proj_template_v/gateware'
    /content/CFU-Playground/soc/sim.mk:56: recipe for target 'run' failed
    make[1]: *** [run] Error 1
    make[1]: Leaving directory '/content/CFU-Playground/soc'
    ../proj.mk:354: recipe for target 'load' failed
    make: *** [load] Error 2
    

    To highlight the error:

    ccache: error: Could not find compiler "x86_64-conda-linux-gnu-c++" in PATH
    

    Looking under env/conda/envs/cfu-common/bin/, I see:

    /content/CFU-Playground/env/conda/bin/x86_64-conda-linux-gnu-ld
    /content/CFU-Playground/env/conda/bin/x86_64-conda_cos6-linux-gnu-ld
    

    and nothing else with an x86_64 prefix.

    opened by tcal-x 23
  • Add support for verilated CFU in Renode

    Add support for verilated CFU in Renode

    This is related to #11.

    What's new:

    1. New subdirectory common/renode-verilator-integration which inludes CmakeLists.txt, sim_main.cpp and renode_h.patch. These files are used to build verilated CFU library. Since these are adjusted to work as Renode plugin, renode_h.patch changes include path in renode.h to match our custom structure (we download few files, see 2.).
    2. VerilatorIntegrationLibrary - files that are part of library are downloaded to third_party/renode/verilator-integration-library. These files are minimum to build CFU library and they must be downloaded since these are not present in Renode portable. As soon as Renode includes them in portable version we will get rid of this workaround.
    3. Renode scripts are now generating Verilated.CFUVerilatedPeripheral with required settings to run it. It's also appended to predefined scripts so new scripts should not contain CFU in them. Our scripts (proj.mk with generate_renode_scripts.py) will take care if CFU should or should not be included in Renode scripts (if flag SW_ONLY=1 is passed, CFU won't be added).
    4. Projects added to CI are now being tested without any additional build parameters by default. You can add different build variants by adding ci_build_params.txt.{0-9} but it won't turn off default tests. It is done for mnv2_first right now.
    5. example_cfu is added to CI workflow.
    opened by robertszczepanski 20
  • Enable QPI mode for Crosslink-NX Evaluation board

    Enable QPI mode for Crosslink-NX Evaluation board

    This PR enables the usage of QPI mode on CNX ENV board. It depends on the changes made in litex-hub/litespi#53 and in enjoy-digital/litex#979. Since we're switching to the flashboot process that LiteX provides, the additional step of using the MiSoC image file writer (mkmscimg) is required before flashing the software to insert length and CRC32:

    python <path_to_litex>/litex/soc/software/mkmscimg.py -o output.bin --little --fbi <path_to_cfu_proj>/build/software.bin
    

    PS I've noticed that this repo is not using the upstream version of LiteX and instead is using the fork that @tcal-x provides. What is the reason behind this and are there plans to upstream these changes?

    opened by fkokosinski 18
  • Enable running benchmarks on CrossLink NX Evaluation board

    Enable running benchmarks on CrossLink NX Evaluation board

    Currently it is not possible to run benchmarks (or any other piece of software) from this repository on CNX EVN board. This PR adds a new platform (based on existing hps platform), that enables that by using the on-board SPI flash chip. Building gateware/software (in project directory, e.g. proj/proj_template):

    UART_SPEED=115200 PLATFORM=cnx_evn TARGET=lattice_crosslink_nx_evn make bitstream software
    

    Flashing/uploading using ecpprog (in project directory, e.g. proj/proj_template):

    ecpprog -so 2097152 build/software.bin
    ecpprog -S ../../soc/build/cnx_evn.proj_template/gateware/cnx_evn_platform.bit
    
    opened by fkokosinski 18
  • Symbiflow shows Unable to detect GNU compiler type

    Symbiflow shows Unable to detect GNU compiler type

    Hii I am facing an issue while setting up environment for CFU Playground.

    While running make prog TARGET=digilent_arty USE_SYMBIFLOW=1

    i am getting this

    meson /home/mamuneeb/CFU-Playground/third_party/python/pythondata-software-picolibc/pythondata_software_picolibc/data -Dmultilib=false -Dpicocrt=false -Datomic-ungetc=false -Dthread-local-storage=false -Dio-long-long=true -Dformat-default=integer -Dincludedir=picolibc/riscv64-unknown-elf/include -Dlibdir=picolibc/riscv64-unknown-elf/lib --cross-file cross.txt WARNING: Unknown CPU family riscv, please report this at https://github.com/mesonbuild/meson/issues/new The Meson build system Version: 0.63.99 Source dir: /home/mamuneeb/CFU-Playground/third_party/python/pythondata-software-picolibc/pythondata_software_picolibc/data Build dir: /home/mamuneeb/CFU-Playground/soc/build/digilent_arty.proj_template/software/libc Build type: cross build Project name: picolibc Project version: 1.7.7

    ../../../../../third_party/python/pythondata-software-picolibc/pythondata_software_picolibc/data/meson.build:35:0: ERROR: Unable to detect GNU compiler type: Compiler stdout:

    Compiler stderr: riscv64-unknown-elf-gcc: fatal error: cannot execute 'cc1': execvp: No such file or directory compilation terminated.

    A full log can be found at /home/mamuneeb/CFU-Playground/soc/build/digilent_arty.proj_template/software/libc/meson-logs/meson-log.txt make[6]: *** [/home/mamuneeb/CFU-Playground/third_party/python/litex/litex/soc/software/libc/Makefile:43: __libc.a] Error 1 make[6]: Leaving directory '/home/mamuneeb/CFU-Playground/soc/build/digilent_arty.proj_template/software/libc' Traceback (most recent call last): File "./common_soc.py", line 57, in main() File "./common_soc.py", line 53, in main workflow.run() File "/home/mamuneeb/CFU-Playground/soc/board_specific_workflows/general.py", line 125, in run soc_builder = self.build_soc(soc) File "/home/mamuneeb/CFU-Playground/soc/board_specific_workflows/digilent_arty.py", line 73, in build_soc return super().build_soc(soc, **kwargs) File "/home/mamuneeb/CFU-Playground/soc/board_specific_workflows/general.py", line 102, in build_soc soc_builder.build(run=self.args.build, **kwargs) File "/home/mamuneeb/CFU-Playground/third_party/python/litex/litex/soc/integration/builder.py", line 344, in build self._generate_rom_software(compile_bios=use_bios) File "/home/mamuneeb/CFU-Playground/third_party/python/litex/litex/soc/integration/builder.py", line 281, in _generate_rom_software subprocess.check_call(["make", "-C", dst_dir, "-f", makefile]) File "/home/mamuneeb/CFU-Playground/env/conda/envs/cfu-symbiflow/lib/python3.7/subprocess.py", line 363, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['make', '-C', '/home/mamuneeb/CFU-Playground/soc/build/digilent_arty.proj_template/software/libc', '-f', '/home/mamuneeb/CFU-Playground/third_party/python/litex/litex/soc/software/libc/Makefile']' returned non-zero exit status 2. make[5]: *** [/home/mamuneeb/CFU-Playground/soc/common_soc.mk:115: build/digilent_arty.proj_template/gateware/digilent_arty.bit] Error 1 make[5]: Leaving directory '/home/mamuneeb/CFU-Playground/soc' make[4]: *** [../proj.mk:312: prog] Error 2 make[4]: Leaving directory '/home/mamuneeb/CFU-Playground/proj/proj_template`

    It says unable to detect GNU compiler type. Please help.

    opened by mamuneeb 17
  • Dynamic clock control between CPU and CFU

    Dynamic clock control between CPU and CFU

    After tests done in #514 we want to use a possibility of disabling/enabling clocks to minimize power consumption during inference.

    This PR introduces dynamic clock control between CPU and CFU. When CPU receives a confirmation that a command is accepted (cmd.ready asserted) its clock gets disabled until CFU is ready to response (rsp.valid asserted). Then CPU's clock is enabled and CFU's clock is disabled.

    Right now this is a draft since even though clock control between SoC and CFU works, for some reason CFU hangs after receiving a command from CPU. Maybe it requires few additional cycles for setup at system boot?

    There is also a weird anomaly - after boot, power consumption is about 44-45mW (~38mW with CFU clock disabled) but after running some tests (e.g. 1: TfLM Models menu -> 1: HPS models -> g: Golden tests (check for expected outputs)) power consumption rises to around 54mW and stays on that level after each operation (it goes down to 42-44mW during other executions of CFU functions e.g. HPS golden tests, but it rises again after finish). It falls back to 44mW after asserting reset signal in CFU clock domain.

    opened by robertszczepanski 17
  • Use dcache metrics CSRs in benchmarks

    Use dcache metrics CSRs in benchmarks

    This PR introduces displaying of additional dcache-related matrics from CPU's CSRs in benchmarks:

    Running sequential stores benchmark
    Hello, Store!
    Val:28  Cycles: 11389084   Cycles/store: 10
    [Dcache accesses] Before: 3223075888, After: 3224127310, Diff: 1051422
    [Dcache refills]  Before: 952539, After: 952545, Diff: 6
    [Dcache stalls]   Before: 89447457, After: 96635863, Diff: 7188406
    
    opened by fkokosinski 16
  • hps_accel: max pool fails golden tests

    hps_accel: max pool fails golden tests

    hps_accel is exhibiting strange behavior.

    To reproduce:

    1. Use code at PR #450
    2. Build and run on NX/17
    3. From the menu, select 1 (TfLM models), 1 (HPS models), 3 (presence model), g (golden tests) - tests fail
    4. Build and run in simulator (make load PLATFORM=sim) - tests pass.
    opened by alanvgreen 14
  • `mnv2_first` CFU timeout in Renode

    `mnv2_first` CFU timeout in Renode

    CFU has been recently added to Renode and I am working on using it in CI in CFU-Playground. example_cfu_v seems to work fine but I've encountered a problem with mnv2_first project.

    The verilated peripheral is built using Renode Verilator Integration repository and Verilator Integration Library inside Renode VerilatorPlugin. This generates library that is then bind with Renode by adding Verilated.CFUVerilatedPeripheral to a platform. After that whenever there is an opcode pattern that matches CFU pattern CFU is being executed in Renode which executes a function from verilated peripheral library.

    I've already noticed that functionID is incorrectly retrieved from instruction pattern and there is a fix about to be merged soon. It will be done by merging funct7 with funct3 like (funct7 << 3) + funct3. The problem is everything works fine for example_cfu_v but doesn't work for mnv2_first. For mnv2 project every CFU execution ends with Operation timeout from our CFU which means that rsp_valid is never set to 1. I also tried checking rsp_payload_response_ok instead of rsp_valid and then there's no timeout but all tests fail anyway so I'm not sure if that's a good approach.

    Maybe CFU expects something more to be set from CPU in case of mnv2_first and it's not properly handled in execute()? If you have any ideas I would be grateful for help :)

    FYI @tcal-x

    opened by robertszczepanski 14
  • add framebuffer enable realted settings and utility functions

    add framebuffer enable realted settings and utility functions

    Hi :

    In this pull request, I add some framebuffer stuff to enable framebuffer function for testing.

    1. add framebuffer enable realted settings in board_specific_workflows and common_socmk.
    2. add simple framebuffer utility functions, includes draw line, draw box, draw string and fill box.
    3. add spi card enable setting, but not start real trying.

    for framebuffer output function enable, there some add on module could be add.

    1. VGA version, please check the following URLs. https://raspi.tv/2014/gert-vga-666-review-and-video https://hackaday.com/2016/02/21/5-vga-for-raspberry-pi/

    2. HDMI version, buy a HDMI output ready fpga board, such like qmtech wukong board, nexsys video. https://reference.digilentinc.com/learn/programmable-logic/tutorials/nexys-video-hdmi-demo/start

    BR, Akio

    opened by akioolin 14
  • Run Serv in CFU Playground

    Run Serv in CFU Playground

    This PR adds basic, necessary changes to build & run the baseline Serv implementation (i.e. no MDU, no CFU) in CFU Playground. This is the first addition/use of another CPU in the Playground :)

    Here is a test to give Serv a spin. Add the following to the end of the Makefile in proj/proj_template:

    DEFINES += INCLUDE_EMBENCH_PRIMECOUNT
    export SERV=1
    DEFINES += USE_LITEX_TIMER
    

    Then run:

    make prog EXTRA_LITEX_ARGS="--cpu-type=serv --cpu-variant=standard"
    make load EXTRA_LITEX_ARGS="--cpu-type=serv --cpu-variant=standard"
    

    Try to run the primecount workload from the Embench benchmark suite and see the following output running on Arty:

    CFU Playground
    ==============
     2: Functional CFU Tests
     3: Project menu
     4: Performance Counter Tests
     6: Benchmarks
     7: Util Tests
     8: Embench IoT
    main> 8
    
    Running Embench IoT
    
    Embench IoT
    ===========
     1: primecount
     x: eXit to previous menu
    benchmarks> 1
    
    Running primecount
    OK  Benchmark result verified
    Spent 337298843 cycles
    ---
    

    You can also run TFLM Unit/Golden tests and they should all pass. Note that because Serv does not implement mcycle the built in Litex Timer must be used by the user for profiling cycle counts. Functionality to call this timer has been added in the PR, and its use is demonstrated in the Embench workloads.

    opened by ShvetankPrakash 0
  • Single Cycle CFU instruction issue with Vexriscv

    Single Cycle CFU instruction issue with Vexriscv

    Hi @tcal-x and @mithro , I'm experiencing issues with the working of the Vexriscv core, it's CFU bus and execution of single cycle CFU instructions.

    Take the below case, where I have a bunch of single cycle cfu instructions running in a loop.

    for loop{ 
    	       //Based on flags, add offset or put 0
    	       cfu_op1(2,0,0);
    	       //Products
    	       cfu_op1(3,0,0);
    	       //Adding prods
    	       cfu_op1(4,0,0);
    	       //Store back and add into accum buffer
    	       cfu_op1(5,0,0);
        }           
    

    What Im seeing is that some of these instructions , say cfu_op1(3,0,0) are executing twice , even though I didn't intend it to. I have experienced this before with the multi-cycle cfu interface with the Vexriscv core. However, I assumed that the single-cycle cfu interface+ Vexriscv core would work. Do I pull the latest version of the g-cfu repository and have you guys come across this and resolved this bug or is this still there in the Vexriscv core? Please suggest alternate methods to resolve this as well.

    This is the cfu single cycle interface Im using:

      // Trivial handshaking for a combinational CFU
      assign rsp_valid = cmd_valid;
      assign cmd_ready = rsp_ready;
    

    Thanks, Bala.

    opened by bala122 4
  • accuarcy of function perf_get_mcycle64()

    accuarcy of function perf_get_mcycle64()

    Hello! I have some questions about function perf_get_mcycle64(). There exists some difference in the total cycles if I use perf_counters in contrast to not using them. The following shows the result without perf_counter when running KWS model.

    Running kws
    .............
    "Event","Tag","Ticks"
    0,CONV_2D,18426
    1,DEPTHWISE_CONV_2D,4340
    2,CONV_2D,12430
    3,DEPTHWISE_CONV_2D,4290
    4,CONV_2D,11224
    5,DEPTHWISE_CONV_2D,4094
    6,CONV_2D,12518
    7,DEPTHWISE_CONV_2D,4397
    8,CONV_2D,11369
    9,AVERAGE_POOL_2D,90
    10,RESHAPE,2
    11,FULLY_CONNECTED,18
    12,SOFTMAX,12
     Counter |  Total | Starts | Average |     Raw
    ---------+--------+--------+---------+--------------
        0    |     0  |     0  |   n/a   |            0
        1    |     0  |     0  |   n/a   |            0
        2    |     0  |     0  |   n/a   |            0
        3    |     0  |     0  |   n/a   |            0
        4    |     0  |     0  |   n/a   |            0
        5    |     0  |     0  |   n/a   |            0
        6    |     0  |     0  |   n/a   |            0
        7    |     0  |     0  |   n/a   |            0
        85M (     85231808 )  cycles total
    

    However, if I add perf_enable_counter and perf_disable_counter in conv.h to use some perf_counters, the total cycles change.

    Running kws
    .............
    "Event","Tag","Ticks"
    0,CONV_2D,21499
    1,DEPTHWISE_CONV_2D,4239
    2,CONV_2D,9217
    3,DEPTHWISE_CONV_2D,4193
    4,CONV_2D,8028
    5,DEPTHWISE_CONV_2D,3992
    6,CONV_2D,8771
    7,DEPTHWISE_CONV_2D,4709
    8,CONV_2D,8350
    9,AVERAGE_POOL_2D,90
    10,RESHAPE,2
    11,FULLY_CONNECTED,18
    12,SOFTMAX,14
     Counter |  Total | Starts | Average |     Raw
    ---------+--------+--------+---------+--------------
        0    |    57M |     5  |    11M  |     57177105
        1    |    33M | 302720  |   108   |     32978671
        2    |     0  |     0  |   n/a   |            0
        3    |     0  |     0  |   n/a   |            0
        4    |     0  |     0  |   n/a   |            0
        5    |     0  |     0  |   n/a   |            0
        6    |     0  |     0  |   n/a   |            0
        7    |     0  |     0  |   n/a   |            0
        75M (     74897487 )  cycles total
    

    It's strange that total cycles reduce a lot, I don't think it's due to error because the difference is so huge. I know the total cycles are counted by using function perf_get_mcycle64(), which I think will not be affected by whether to use perf_counters. Thanks in advance!

    opened by limingxuan-pku 3
Owner
Google
Google ❤️ Open Source
Google
Number recognition with MNIST on Raspberry Pi Pico + TensorFlow Lite for Microcontrollers

About Number recognition with MNIST on Raspberry Pi Pico + TensorFlow Lite for Microcontrollers Device Raspberry Pi Pico LCDディスプレイ 2.8"240x320 SPI TFT

iwatake 51 Dec 16, 2022
Eloquent interface to Tensorflow Lite for Microcontrollers

This Arduino library is here to simplify the deployment of Tensorflow Lite for Microcontrollers models to Arduino boards using the Arduino IDE.

null 188 Dec 26, 2022
TensorFlow Lite for Microcontrollers

TensorFlow Lite for Microcontrollers Build Status Official Builds Community Supported Builds Additional Documentation TensorFlow Lite for Microcontrol

null 998 Jan 3, 2023
Pose-tensorflow - Human Pose estimation with TensorFlow framework

Human Pose Estimation with TensorFlow Here you can find the implementation of the Human Body Pose Estimation algorithm, presented in the DeeperCut and

Eldar Insafutdinov 1.1k Dec 29, 2022
Openvino tensorflow - OpenVINO™ integration with TensorFlow

English | 简体中文 OpenVINO™ integration with TensorFlow This repository contains the source code of OpenVINO™ integration with TensorFlow, designed for T

OpenVINO Toolkit 169 Dec 23, 2022
TensorFlow Lite, Coral Edge TPU samples (Python/C++, Raspberry Pi/Windows/Linux).

TensorFlow Lite, Coral Edge TPU samples (Python/C++, Raspberry Pi/Windows/Linux).

Nobuo Tsukamoto 87 Nov 16, 2022
Swapping face using Face Mesh with TensorFlow Lite

demo.mp4 Aiine Transform (アイン変換) Swapping face using FaceMesh. (could be used to unveil masked faces) Tested Environment Computer Windows 10 (x64) + V

iwatake 17 Apr 26, 2022
A demo to run tensorflow-lite on Penglai TEE.

Documentation TensorFlow is an end-to-end open source platform for machine learning. It has a comprehensive, flexible ecosystem of tools, libraries, a

IPADS 4 Dec 15, 2021
Helper Class for Deep Learning Inference Frameworks: TensorFlow Lite, TensorRT, OpenCV, ncnn, MNN, SNPE, Arm NN, NNAbla

InferenceHelper This is a helper class for deep learning frameworks especially for inference This class provides an interface to use various deep lear

iwatake 192 Dec 26, 2022
Lite.AI.ToolKit 🚀🚀🌟: A lite C++ toolkit of awesome AI models such as RobustVideoMatting🔥, YOLOX🔥, YOLOP🔥 etc.

Lite.AI.ToolKit ?? ?? ?? : A lite C++ toolkit of awesome AI models which contains 70+ models now. It's a collection of personal interests. Such as RVM, YOLOX, YOLOP, YOLOR, YoloV5, DeepLabV3, ArcFace, etc.

DefTruth 2.4k Jan 9, 2023
vs2015上使用tensorRT加速yolov5推理(Using tensorrt to accelerate yolov5 reasoning on vs2015)

1、安装环境 CUDA10.2 TensorRT7.2 OpenCV3.4(工程中已给出,不需安装) vs2015 下载相关工程:https://github.com/wang-xinyu/tensorrtx.git 2、生成yolov5s.wts文件 在生成yolov5s.wts前,首先需要下载模

null 16 Apr 19, 2022
VNOpenAI 31 Dec 26, 2022
OpenEmbedding is an open source framework for Tensorflow distributed training acceleration.

OpenEmbedding English version | 中文版 About OpenEmbedding is an open-source framework for TensorFlow distributed training acceleration. Nowadays, many m

4Paradigm 19 Jul 25, 2022
Spying on Microcontrollers using Current Sensing and embedded TinyML models

Welcome to CurrentSense-TinyML CurrentSense-TinyML is all about detecting microcontroller behaviour with current sensing and TinyML. Basically we are

Santander Security Research 71 Sep 17, 2022
Deep Learning API and Server in C++11 support for Caffe, Caffe2, PyTorch,TensorRT, Dlib, NCNN, Tensorflow, XGBoost and TSNE

Open Source Deep Learning Server & API DeepDetect (https://www.deepdetect.com/) is a machine learning API and server written in C++11. It makes state

JoliBrain 2.4k Dec 30, 2022
Training and Evaluating Facial Classification Keras Models using the Tensorflow C API Implemented into a C++ Codebase.

CFace Training and Evaluating Facial Classification Keras Models using the Tensorflow C API Implemented into a C++ Codebase. Dependancies Tensorflow 2

null 7 Oct 18, 2022
TensorFlow implementation of SQN based on RandLA-Net's encoder

SQN_tensorflow This repo is a TensorFlow implementation of Semantic Query Network (SQN). For Pytorch implementation, check our SQN_pytorch repo. Our i

PointCloudYC 9 Nov 3, 2022
PSTensor provides a way to hack the memory management of tensors in TensorFlow and PyTorch by defining your own C++ Tensor Class.

PSTensor : Custimized a Tensor Data Structure Compatible with PyTorch and TensorFlow. You may need this software in the following cases. Manage memory

Jiarui Fang 8 Feb 12, 2022
Movenet cpp deploy; model transformed from tensorflow

MoveNet-PaddleLite Adapted from PaddleDetection; Movenet cpp deploy based on PaddleLite; Movenet model transformed from tensorflow; 简介 Movenet是近年的优秀开源

null 11 Dec 27, 2022