Skip to content

Instantly share code, notes, and snippets.

View crazyguitar's full-sized avatar
🎯
Focusing

CHANG-NING TSAI crazyguitar

🎯
Focusing
View GitHub Profile
// CUDA-C includes
#include <cuda.h>
#include <stdio.h>
#include "../common/book.h"
extern "C"
int runCudaPart();
// CODE goes below
@crazyguitar
crazyguitar / docs.md
Created February 8, 2026 06:53 — forked from rbiswasfc/docs.md
SGLang 0.4.3

File: README.md

SGLang Documentation

We recommend new contributors start from writing documentation, which helps you quickly understand SGLang codebase. Most documentation files are located under the docs/ folder. We prefer Jupyter Notebooks over Markdown so that all examples can be executed and validated by our docs CI pipeline.

Docs Workflow

Install Dependency

@crazyguitar
crazyguitar / gist:4457149d58218c75c83c9a792bcf938f
Created December 6, 2025 07:22 — forked from gcmurphy/gist:6217834
Templates for loop unrolling / metaprogramming.
#include <iostream>
#include <cstdint>
template<int i, uint64_t val, typename Function>
class Loop {
public:
static inline void call(Function f){
f(i, val);
Loop<i-1, val, Function>::call(f);
}
@crazyguitar
crazyguitar / memory_check.cpp
Created December 6, 2025 06:51 — forked from thirdwing/memory_check.cpp
C++ code to print out runtime memory usage
#include <iostream>
#include <fstream>
#include <unistd.h>
void process_mem_usage(double& vm_usage, double& resident_set)
{
vm_usage = 0.0;
resident_set = 0.0;
// the two fields we want
@crazyguitar
crazyguitar / MoE.py
Created October 4, 2025 04:00 — forked from ruvnet/MoE.py
A PyTorch implementation of a Mixture of Experts (MoE) model resembling the Mixtral 8x7B architecture, with detailed inline comments. This model combines transformer layers with an MoE layer consisting of 8 experts, aiming for high efficiency by activating only 2 experts per token. It's configured with dimensions reflecting the operational effic…
"""
This model integrates the MoE concept within a Transformer architecture. Each token's
representation is processed by a subset of experts, determined by the gating mechanism.
This architecture allows for efficient and specialized handling of different aspects of the
data, aiming for the adaptability and efficiency noted in the Mixtral 8x7B model's design
philosophy. The model activates only a fraction of the available experts for each token,
significantly reducing the computational resources needed compared to activating all experts
for all tokens.
"""
@crazyguitar
crazyguitar / coro.cpp
Created January 8, 2025 07:23 — forked from Qix-/coro.cpp
C++20 coroutines + LibUV sample, v2
// Thank you to the folks at the C++ slack channel,
// along with @lewissbaker for the excellent literature
// (even though it took me a few days to be convinced
// it really was so).
#include <uv.h>
#include <iostream>
#include <experimental/coroutine>
@crazyguitar
crazyguitar / dag.py
Created December 25, 2024 00:13 — forked from OhadRubin/dag.py
import networkx as nx
from itertools import product
"""
When we compare this code with Airflow, the strengths of your code lie in its simplicity, lightweight nature, and the ability to easily integrate with existing Python code:
Simplicity: This code provides a simple and straightforward way to model and work with DAGs without needing to go through the process of setting up and configuring a comprehensive system like Airflow. For smaller teams or projects with less complexity, this can be an advantage.
Lightweight and easy to incorporate: Your code is a compact, single-file solution that can be easily integrated into an existing Python project without having to set up an entire Airflow environment. When your primary focus is on creating task dependencies with parameter combinations, rather than scheduling and monitoring, your code is easier to incorporate.
Focused on task generation: Your code emphasizes creating a Cartesian product of tasks associated with nodes' parameters. It is geared towards tackling
@crazyguitar
crazyguitar / nsight.sh
Created October 4, 2024 18:23 — forked from mcarilli/nsight.sh
Favorite nsight systems profiling commands for Pytorch scripts
# This isn't supposed to run as a bash script, i named it with ".sh" for syntax highlighting.
# https://developer.nvidia.com/nsight-systems
# https://docs.nvidia.com/nsight-systems/profiling/index.html
# My preferred nsys (command line executable used to create profiles) commands
#
# In your script, write
# torch.cuda.nvtx.range_push("region name")
# ...
@crazyguitar
crazyguitar / bench.py
Created June 15, 2024 02:06 — forked from marians/bench.py
Benchmarking serialization/unserialization in python using json, pickle and cPickle
import cPickle
import pickle
import json
import random
from time import time
from hashlib import md5
test_runs = 1000
def float_list():
@crazyguitar
crazyguitar / commands.md
Created June 11, 2024 20:13 — forked from mcarilli/commands.md
Single- and multiprocess profiling workflow with nvprof and NVVP (Nsight Systems coming soon...)

Ordinary launch commands (no profiling):

Single-process:

python main_amp.py -a resnet50 --b 224 --deterministic --workers 4 --opt-level O1 ./bare_metal_train_val/

Multi-process:

python -m torch.distributed.launch  --nproc_per_node=2 main_amp.py -a resnet50 --b 224 --deterministic --workers 4 --opt-level O1 ./bare_metal_train_val/