Ghazi Bouselmi's Pages
Here you can find out who I am. Well, also my CV, my publications
LinkedIn : http://www.linkedin.com/in/ghazibouselmi
Here are some of the main programs I have developed (content is still under population, I'm looking into my archives :) )
Please be lenient, these projects have been carried out as a hobby, and (for the most recent) after long working days !
High-Throughput C++ Market By Order (MBO) Data System (November 2025)
This project involved the design and implementation of a high-performance, multithreaded client-server application in C++. The server streams real-time Market By Order (MBO) data to a client, which is responsible for reconstructing and maintaining the order book in memory with minimal latency.
The system was engineered to meet the stringent performance requirements typical of financial trading applications, with performance optimized for standard off-the-shelf hardware.
Key Performance Metrics
The system demonstrated exceptional performance when processing a high-load simulation for a single trading instrument:
Metric Result Environment
Throughput 700,000 messages/second Standard Core i7
Client Latency 12µs (average) Socket receive to book update
These metrics demonstrate a focus on efficiency in all stages of the data pipeline, from network ingress to data processing and memory management.
Architectural & Optimization Strategies
To achieve sub-microsecond latency and high throughput, the architecture relied on several key low-level optimizations:
Zero Dynamic Memory Allocation
The most critical optimization was the complete elimination of dynamic memory allocations in the data's critical path. Memory management was handled via pre-allocation and custom pool management to ensure deterministic performance and avoid latency spikes caused by system memory management (e.g., malloc or new).
Custom Memory Management Abstractions
Two primary data structures were developed to manage object lifecycles efficiently without relying on standard library containers that perform allocations:
Allocation Table (FAT-inspired): An std::vector-based memory pool resembling a File Allocation Table (FAT). It uses indices rather than pointers to manage the state (allocated/free) of elements within a contiguous memory block. This provides rapid object retrieval and return while keeping memory compact.
In-Place Double-Linked List: A custom linked list implementation where elements are linked by pointers (prev & next) and managed by the list itself without allocating them. It acts as a "linked list manager" for objects pulled from the allocation table, using "pre-begin" and "end" sentinel nodes for seamless insertion and removal without dynamic memory overhead.
Socket-Level & Buffer Optimizations
Buffer Pre-allocation: Large, persistent buffers were used for socket input/output to avoid memory allocation overhead during runtime.
Socket Tuning: Operating system-level socket options (e.g., TCP_NODELAY, receive buffer sizes) were optimized to minimize network-related latency.
Multithreading: The system architecture leverages multithreading to separate network I/O concerns from the book-reconstruction logic, maximizing CPU core utilization and preventing I/O from blocking critical processing paths.
Technologies Used
Language: C++ (C++17/20 features for concurrency and std::chrono)
Libraries: Standard Template Library (STL), OS-level socket APIs (e.g., POSIX sockets)
Time Measurement: Nanosecond-resolution timestamps (std::chrono::system_clock::now()) were used for precise latency measurement.
High-Performance C++ Quadtree Implementation (November 2025)
This project involved the design and implementation of a custom, high-performance quadtree data structure in C++. The implementation was inspired by a previous project developed in MATLAB, addressing the performance limitations inherent in object-oriented MATLAB environments by minimizing memory allocation overhead in C++.
The system was engineered to efficiently manage and query 2D spatial data (square objects) within a defined boundary, while adhering to stringent memory management requirements for deterministic performance.
Architectural & Optimization Strategies
To achieve efficiency and avoid performance bottlenecks typical of dynamic memory allocation, the architecture relied on several key low-level optimizations:
Zero Dynamic Memory Allocation: The quadtree was implemented as a single, contiguous data structure within a standard std::vector, completely eliminating dynamic allocation hierarchy (no node objects). This ensures deterministic performance and avoids latency spikes caused by system memory management.
Custom Data Management: Data objects (2D squares) are stored in a single contiguous vector, managed efficiently within the flat structure.
Hierarchical Constraints: The structure enforces specific leaf node constraints based on a minimum side length (precedence over maximum object count) to maintain a balanced and performant tree.
Functionality & Querying
The implementation supports robust querying of inserted data objects:
Spatial Queries: The primary functionality allows for finding all inserted data objects that contain a given 2D coordinate point within their bounds.
Type Constraints: Queries can be performed with or without an additional data object type constraint.
Technologies Used
Language: C++
Libraries: Standard Template Library (STL)
High-Throughput Verilog Crypto-Currency Hashing Chips (Feb - Jun 2022)
This project involved the design and implementation of seven high-performance crypto-currency hashing chips in Verilog 2001 on my personal time. The chips are designed to operate in large networks, featuring a common inter-chip networking infrastructure for automatic discovery and message routing. The system was engineered with redundancy and self-testing capabilities typical of robust ASIC designs.
Key Projects & Features
The following seven crypto-currency chips were developed and validated:
Bitcoin (SHA256d)
Dogecoin (Scrypt)
Siacoin (Blake2b)
Kadena (Blake2s)
Ergo (Autolykos2)
Nervos-CKB (Eaglesong)
Handshake (Handshake)
Architectural & Optimization Strategies
To enable large, fault-tolerant networks, the architecture relied on several key features:
Networking Infrastructure: Chips feature automatic neighbor discovery, DHCP-like address assignment, message routing, and automatic root node election.
Scalability & Redundancy: Networks can scale to hundreds of chips, designed to connect to four neighbors (one can be a main controller/FPGA), with multiple connection points for redundancy.
Configurable Buffers: Bitcoin, Siacoin, and Dogecoin chips allow configurable sizes for coinbase and merkle-tree buffers.
Automated Self-Test: Chips include a configurable number of redundant hashing cores and hashers, with an automated self-test upon startup that disables faulty parts and runs with the valid ones.
Technologies Used
Language: Verilog 2001
Validation: Full software simulation test implemented.
Hardware: Bitcoin, Dogecoin, Siacoin, Kadena, and CKB prototypes implemented on Xilinx Zynq-Ultrascale+ ZCU104 FPGA. (Note: Handshake too large for available FPGA; Ergo incompatible with FPGAs).
more details for the Bitcoin chip (video demo on FPGA + video demo on software simulation)
more details for the Dogecoin chip (video demo on FPGA + video demo on software simulation)
more details for the Siacoin chip (video demo on FPGA + video demo on software simulation)
more details for the Kadena chip (video demo on FPGA + video demo on software simulation)
more details for the Ergo chip
more details for the CKB chip (video demo on FPGA + video demo on software simulation)
more details for the Handshake chip (vivado sythesis + video demo on software simulation)
A project analysis for the manufacturing of crypto mining machines (for commercial sales & mining) can be found here 12-07-2022--Project-analysis.pdf
No source code is provided for these chips. For further information about these chips or for business ideas, please contact me here: ghazi.bousselmi@gmail.com or info@tachchouri-shop.nl
Video demo for a PCIe hashing card, featuring 45 hashing chips, and a maximum of 2kW of power: here
- HASKELL: A logical inference processor as defined in "Elementary Wattson", a coding challenge on www.hackerrank.com. Still under development.
detailed results.
- HASKELL: A program compiler & interpreter for the "While language", a coding challenge on www.hackerrank.com. A more complex grammar is accepted, similar to C/C++, including variable declaration, for loops, function definition & calls, most of C operators, including post & prefix increments "++" & "--", and the ternary "? : ".
detailed results.
- C++: A class allowing arbitrary size unsigned integer arithmetics: CUBigNumberV4.cpp - Oct-2017
- HASKELL: www.hackerrank.com challenges, rank top 0.1%:
Compile and Interpret an Intuitive language (expert): Attempt_16-06-2017.hs
Interpret a Brain-Fart program :) (advanced): Attempt_14-06-2017.hs
Compile & interpret a C-like program (expert): Attempt_13-06-2017.hs
Simplify an algebraic expression (hard): Attempt_06-06-2017.hs
Evaluate a numerical expression (expert): Attempt_06-06-2017.hs
Messy Medians (hard): Attempt_02-06-2017.hs
Tree of Life (expert): Attempt_10-04-2017.hs
- C/C++ & Algorithms: www.hackerrank.com challenges, rank top 0.2%:
Balanced Forest (Algorithms / Trees) (hard): Attempt_02_01_2021.cpp
Merge Sort: Counting Inversions (Algorithms / Sorting) (hard): Attempt_01_01_2021.cpp
Reverse Shuffle Merge (Algorithms / Greedy - Viterbi-like) (advanced): Attempt_27_12_2020.cpp
Determining DNA health (graph theory / Viterbi-Like) (hard): Attempt_23_12_2020.cpp
Chief Hopper (hard): (constant time complexity solution) Attempt_22_feb_2019.cpp
Build a String (hard): Attempt_jan_2019.cpp
Similar Pair (advanced): Attempt_jan_2019.cpp
Find (sub-)Strings (expert): Attempt_dec_2018.cpp
The Captcha Craker (medium): Attempt_nov_2018.cpp
Cube Summation (hard): Attempt_feb-2017.cpp
Array Construction (advanced): Attempt_feb-2017.cpp
Library Query (advanced): Attempt_feb-2017.cpp
Save Humanity (expert): Attempt_jan-2017.cpp
String Similarity (expert): Attempt_jan-2017.cpp
String function calculation (advanced): Attempt_jan-2017.cpp
Minimal String Merging (expert): Attempt_jan-2017.cpp
Favourite sequence (advanced): Attempt_jan-2017.cpp
Lovely Triplets (advanced): Attempt_dec-2016.cpp
KMP Problem (hard): Attempt_dec-2016.cpp
Beautiful 3 Set (hard): Attempt_nov-2016.cpp
Inverse RMQ (hard): Attempt_nov-2016.cpp
Magic spells (hard): Attempt_nov-2016.cpp
Bit array (hard): Attempt_nov-2016.cpp
C++ Variadics (hard): Attempt_nov-2016.cpp
- A simple network simulation: physical link, network card, IP-Module, TCP-Module, full IP routing, TCP sockets (listen/ accept/ connect/ disconnect/ send/ recv).
download here (still under development). detailed results.
- A memory manager that transparently hooks all C-Runtime memory functions and redirects them to private memory management methods. Aimed at minimizing memory fragmentation, and improving performance. This idea was originally developed withing my work. I re-implemented a simpler version on my personal time (along with an improved object-pool).
download here (still under development). details results.
- A library of synchronization offering the same functionality as Windows-API for HANDLEs, Events, Waitable-Timers, Thread-Creation & handling, Critical-Sections, Mutex(es), Semaphores, with the appropriate API alternatives for creation/destruction, locking/unlocking, waiting on objects, thread-sleep (up to micro-second precision in most cases). Originally developed at work, I fully re-implemented it on my personal time, in a style more inline with c++11. This library offers the possibility to operate using its own thread, or using the lifetime of (client) threads that are calling it. The library offers a private implementation of critical section (based on Interlocked-like API), and performs 5-10 times faster than windows API critical section and 2-2.5 times faster than STL's std::mutex, using a spin-wait for locking. The HANDLE-oriented operations are based on a FIFO-circular-buffer using similar techniques than the latter, and are as a result 10 to 100 times faster than their Windows-API counterparts (WaitForSingleObject, WaitForMultipleObject, …). It is worth noting that these operations and that circular buffer are thread-safe without using any mutex or critical section, the circular buffer itself is 2-5 times faster than a regular mutex-protected one. Beyond the speed improvement, in most usages and for the same algorithms, this library reduces the CPU usage as compared to windows API or STL's mutex.
download here (still under development). text results. detailed results.
- A Win32/Linux HTTP-proxy, featuring several private encryption algorithms and using keys up to 64KB of size. The program can act as a full HTTP-proxy, or as bridge (HTTP-proxy requests forwarding to other proxies). Thus, it can be used as an anonymizer allowing the local machine to access the web through the identity (and connection) of a remote machine, with highly encrypted HTTP stream(s) between the two.
- A compiler for intel(r) Pentium 4 (r) assembly language, under delphi 6 (you can download the free "Turbo Delphi 6" to run it). I included an executable file, tested with Northon antivirus (July 2008). Accepts a pure assembly language to generate .COM and .EXE dos executables, and a Pascal-like program structure (units, functions, variables, constants ... uses the file "system.pas", which is a sample library containing some functions such as write_wrod() ...) to generate a .EXE file.
- 3D rendering, Ray-Tracing & “Inifinte Detail” (under Linux): following in the foot-steps of the software developed by http://www.euclideon.com/ (infinite detail engine). The framework features a world of maximum size of [ 222 x 222 x 222 ] (equivalent to 1 kilometer cube, with a granularity of 64 pixels per millimeter cube, each pixel having color and transparency properties), and offering a modular architecture where references to (an) object(s) can be inserted (in the world) multiple times with different position/orientation & scaling factors, and offering several types of cameras (planar, cylindrical, spherical, ...). Although the world size is 266, the software can accommodate several thousands of objects (or copies) of sizes in the range of 212 to 215 (cubed) with a memory load of about 1.5 Gigabytes. The drawing procedure is multi-threaded. As an example, with 8 CPU cores, 10 thousand objects (spheres / cubes) of size 212 cubed, and a planar camera, the framework renders 8 to 10 frames per second (of resolution 800x600), without any hardware GPU acceleration, only assembler code optimization (SIMD).
download here (warning: messy code ! :) ).
- Another version of the 3D Ray-Tracing, using parts of the previous, with a style more inline with c++11 (still under development).
download here, results here.
- Mersenne-Primes' checking using hyper-parallel GPU computing (Microsoft C++AMP).
- Optimization: calculation of possible Knight's-move sequence numbers (part of a job application).
- A 3D car racing game, an attempt to a 3D version of the famous "Micromachines 2" from "Codemasters". Developed by me and Bdioui Houssem under Visual C++, within an academic project for engineer studies. The game is based on Direct3D 8.0, and features multiplayer with configurable keyboard, IA with different levels and 2 maps. Map editor will soon be available.
- A trojan virus, allowing to control the infected machine. Developed by me and Bdioui Houssem under Visual C++, within an academic project for engineer studies. The project is composed of two parts: a server(virus) and a client. The server allows the client to remotely control the machine: create/delete/download/upload files, kill/lauch processes, control the mouse/change cursor, control keyboard, shut down the machine ...
- 2 mini-programs developed during my recruitment test in my current job. Please be lenient, at that time I was not familiar with Windows-API nor MFC programming since I have been programming under linux for the previous 5 years (Masters and PhD).