QGTC: Accelerating Quantized GNN via GPU Tensor Core
- Cite this project and paper.
@inproceedings{QGTC, title={QGTC: Accelerating Quantized GNN via GPU Tensor Core}, author={Yuke Wang and Boyuan Feng and Yufei Ding}, booktitle={ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. (PPoPP'22)}, year={2022} }
Clone this project.
git clone git@github.com:YukeWang96/PPoPP22_QGTC.git
Dependencies.
- Python 3.7+.
- PyTorch 1.5.0+.
- Deep Graph Library.
Environment Setup.
[Method-1] Install via Docker (Recommended).
- (i) Pull docker image:
docker pull happy233/qgtc:updated docker run -it --rm --gpus all -v $PWD/:/qgtc happy233/qgtc:updated /bin/bash
- (ii) Build docker from scratch:
cd Docker/ ./build.sh ./launch.sh
[Method-2] Install via Conda.
- Install
conda
on system Toturial. - Create a
conda
environment:conda create -n env_name python=3.6
- Install
Pytorch
:conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch -c conda-forge
or using
pip
[Note that make sure thepip
you use is thepip
from current conda environment. You can check this bywhich pip
]pip install torch==1.8.0+cu111 torchvision==0.9.0+cu111 torchaudio==0.8.0 -f https://download.pytorch.org/whl/torch_stable.html
- Install
Deep Graph Library (DGL)
.conda install -c dglteam dgl-cuda11.0 pip install torch requests
Install QGTC. Go to QGTC_module/
, then run
TORCH_CUDA_ARCH_LIST="8.6" python setup.py clean --all install
Running Experiments
Download datasets
./download_dataset.sh
Figure 7. Speedup comparison.
- (a) Cluster GCN. You can change the
bitwidth=4
at0_7_eval_QGTC_cluster_GCN.py
to[1,2,4,8]
for evaluation. default is 2 bit../0_7a_eval_QGTC_cluster_GCN.py ./1_7a_eval_DGL_cluster_GCN.py
Check the results in
QGTC_cluster_GCN_*bit.csv
andDGL_cluster_GCN.csv
. You will expect the result likes this.
dataset | Epoch (ms) |
---|---|
artist | 263.646 |
soc-BlogCatalog | 209.495 |
ppi | 189.016 |
ogbn-arxiv | 208.616 |
- (b) Batched GIN. You can change the
bitwidth=4
at0_7b_eval_QGTC_batched_GIN.py
to [1,2,4,8] for evaluation. default is 2 bit../0_7b_eval_QGTC_batched_GIN.py ./1_7b_eval_DGL_batched_GIN.py
Check the results in
QGTC_batched_GIN_*bit.csv
andDGL_batched_GIN.csv
.
Figure 8: Additional studies.
- (a) Comparison with the cuBLASgemmEX (int8) on Tensor Core.
./3_8a_QGTC_GEMM_INT8.py cd cuBLASGemmEX/ ./compile.sh ./bench_cuBLAS_INT8.py
running
./2_7c_cuBLAS_INT8.py
you will get the result like this======== 1 bit ================== X1_height 1024, X1_width: 1024, X2_width: 16, TFLOPs: 5.847 X1_height 2048, X1_width: 2048, X2_width: 16, TFLOPs: 16.605 X1_height 4096, X1_width: 4096, X2_width: 16, TFLOPs: 40.627 X1_height 1024, X1_width: 1024, X2_width: 32, TFLOPs: 11.724 X1_height 2048, X1_width: 2048, X2_width: 32, TFLOPs: 32.666 X1_height 4096, X1_width: 4096, X2_width: 32, TFLOPs: 35.032 X1_height 1024, X1_width: 1024, X2_width: 64, TFLOPs: 23.219 X1_height 2048, X1_width: 2048, X2_width: 64, TFLOPs: 37.438 X1_height 4096, X1_width: 4096, X2_width: 64, TFLOPs: 46.768 ======== 2 bit ================== X1_height 1024, X1_width: 1024, X2_width: 16, TFLOPs: 3.934 X1_height 2048, X1_width: 2048, X2_width: 16, TFLOPs: 10.086 X1_height 4096, X1_width: 4096, X2_width: 16, TFLOPs: 20.764 X1_height 1024, X1_width: 1024, X2_width: 32, TFLOPs: 7.864 X1_height 2048, X1_width: 2048, X2_width: 32, TFLOPs: 19.762 X1_height 4096, X1_width: 4096, X2_width: 32, TFLOPs: 20.951 X1_height 1024, X1_width: 1024, X2_width: 64, TFLOPs: 15.429 X1_height 2048, X1_width: 2048, X2_width: 64, TFLOPs: 25.055 X1_height 4096, X1_width: 4096, X2_width: 64, TFLOPs: 26.818 ======== 4 bit ================== X1_height 1024, X1_width: 1024, X2_width: 16, TFLOPs: 2.488 X1_height 2048, X1_width: 2048, X2_width: 16, TFLOPs: 6.561 X1_height 4096, X1_width: 4096, X2_width: 16, TFLOPs: 12.409 X1_height 1024, X1_width: 1024, X2_width: 32, TFLOPs: 4.456 X1_height 2048, X1_width: 2048, X2_width: 32, TFLOPs: 12.807 X1_height 4096, X1_width: 4096, X2_width: 32, TFLOPs: 13.929 X1_height 1024, X1_width: 1024, X2_width: 64, TFLOPs: 10.683 X1_height 2048, X1_width: 2048, X2_width: 64, TFLOPs: 12.328 X1_height 4096, X1_width: 4096, X2_width: 64, TFLOPs: 14.196 ======== 8 bit ================== X1_height 1024, X1_width: 1024, X2_width: 16, TFLOPs: 1.541 X1_height 2048, X1_width: 2048, X2_width: 16, TFLOPs: 3.483 X1_height 4096, X1_width: 4096, X2_width: 16, TFLOPs: 6.763 X1_height 1024, X1_width: 1024, X2_width: 32, TFLOPs: 3.074 X1_height 2048, X1_width: 2048, X2_width: 32, TFLOPs: 6.816 X1_height 4096, X1_width: 4096, X2_width: 32, TFLOPs: 7.366 X1_height 1024, X1_width: 1024, X2_width: 64, TFLOPs: 5.046 X1_height 2048, X1_width: 2048, X2_width: 64, TFLOPs: 6.165 X1_height 4096, X1_width: 4096, X2_width: 64, TFLOPs: 7.324
running
./bench_cuBLAS_INT8.py
, you will get the result like thisM: 1024, K: 1024, N: 16, TFLOPS: 0.55 M: 2048, K: 2048, N: 16, TFLOPS: 2.58 M: 4096, K: 4096, N: 16, TFLOPS: 3.60 M: 1024, K: 1024, N: 32, TFLOPS: 3.89 M: 2048, K: 2048, N: 32, TFLOPS: 5.49 M: 4096, K: 4096, N: 32, TFLOPS: 6.49 M: 1024, K: 1024, N: 64, TFLOPS: 4.38 M: 2048, K: 2048, N: 64, TFLOPS: 6.30 M: 4096, K: 4096, N: 64, TFLOPS: 6.65
- (b) Zero-tile jumping efficiency.
./3_8_zero_tile_jumping.py
check the results in
zerotile_jumping.csv
- (c) Adjacencymatrix size impact.
./3_9_adjmatrix_size.py
you will get the result like this
X1_height 1024, X1_width: 1024, X2_width: 16, TFLOPs: 5.831 X1_height 2048, X1_width: 2048, X2_width: 16, TFLOPs: 16.323 X1_height 4096, X1_width: 4096, X2_width: 16, TFLOPs: 34.425 X1_height 1024, X1_width: 1024, X2_width: 32, TFLOPs: 11.717 X1_height 2048, X1_width: 2048, X2_width: 32, TFLOPs: 32.027 X1_height 4096, X1_width: 4096, X2_width: 32, TFLOPs: 40.175 X1_height 1024, X1_width: 1024, X2_width: 64, TFLOPs: 23.158 X1_height 2048, X1_width: 2048, X2_width: 64, TFLOPs: 37.444 X1_height 4096, X1_width: 4096, X2_width: 64, TFLOPs: 46.759 X1_height 1024, X1_width: 1024, X2_width: 128, TFLOPs: 28.417 X1_height 2048, X1_width: 2048, X2_width: 128, TFLOPs: 40.646 X1_height 4096, X1_width: 4096, X2_width: 128, TFLOPs: 52.517 X1_height 1024, X1_width: 1024, X2_width: 256, TFLOPs: 32.089 X1_height 2048, X1_width: 2048, X2_width: 256, TFLOPs: 44.151 X1_height 4096, X1_width: 4096, X2_width: 256, TFLOPs: 59.508 X1_height 1024, X1_width: 1024, X2_width: 512, TFLOPs: 41.743 X1_height 2048, X1_width: 2048, X2_width: 512, TFLOPs: 49.687 X1_height 4096, X1_width: 4096, X2_width: 512, TFLOPs: 64.172 X1_height 1024, X1_width: 1024, X2_width: 1024, TFLOPs: 37.954 X1_height 2048, X1_width: 2048, X2_width: 1024, TFLOPs: 52.970 X1_height 4096, X1_width: 4096, X2_width: 1024, TFLOPs: 66.490