Skip to content

Commit

Permalink
add logo for flag-attention & link to flag-open
Browse files Browse the repository at this point in the history
  • Loading branch information
iclementine committed Feb 4, 2024
1 parent 548382d commit a4e7b02
Show file tree
Hide file tree
Showing 9 changed files with 24 additions and 5 deletions.
18 changes: 14 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
# FlagAttention

<p align="center">
<img src="./assets/logo/horizontal-blue.png" width = "400" alt="flag-attention" >
</p>

[中文版](./README_cn.md)

FlagAttention is a project for memory-efficient attention operators implemented in the [Triton language](https://github.com/openai/triton). Motivated by the need for non-standard attention operators in language modeling, it starts as an extension of multi-head attention.
Expand Down Expand Up @@ -75,7 +79,7 @@ FlagAttention provides customized operators for attention. When an operator is e

A recent version of `pytest`(>=7.1.0) is required to run the tests in `tests/`. Operators in `FlagAttention` are tested against [reference implementations](src/flag_attn/testing) in Pytorch provided by `flag_attn.testing`, both for the forward and backward operators. For operators with support for inputs of `float16` or `bfloat16`, three different implementations are included for numerical accuracy testing.

1. **Reference Implementation in Pytorch**: This implementation upcasts the inputs to `float32` and performs the computations in `float32` all the way through before casting the outputs to `float16` or `bfloat16`.
1. **Reference Implementation in Pytorch**: This implementation upcasts the inputs to `float32` and performs the computations in `float32` all the way through before casting the outputs to `float16` or `bfloat16`.
2. **Triton Implementation**: The Triton implementation uses `float16` or `bfloat16` for MMA(matrix multiplication accumulation) inputs and `float32` for MMA outputs and other computations.
3. **Pytorch Implementation**: This implementation mirrors the computations in the reference implementation, except that the precision is the same as the Triton implementation.

Expand Down Expand Up @@ -114,9 +118,9 @@ In addition to the attention outputs, it can return some extra outputs dependes

### piecewise_attention

The first extension to FlashAttention is [piecewise_attention](src/flag_attn/piecewise.py). This operator enhances FlashAttention by using two `q`'s and two `k`'s to calculate the attention scores(S) before applying softmax to obtain the attention weights(P).
The first extension to FlashAttention is [piecewise_attention](src/flag_attn/piecewise.py). This operator enhances FlashAttention by using two `q`'s and two `k`'s to calculate the attention scores(S) before applying softmax to obtain the attention weights(P).

The rationale behind this design is rooted in the observations that a transformer with rotary position embedding struggles with predicting sequences longer than the maximum sequence length it is trained on. Pairs of `(q, k)` yield unexpectedly high attention scores when the distance exceeds the maximum sequence length in the training set.
The rationale behind this design is rooted in the observations that a transformer with rotary position embedding struggles with predicting sequences longer than the maximum sequence length it is trained on. Pairs of `(q, k)` yield unexpectedly high attention scores when the distance exceeds the maximum sequence length in the training set.

To address this issue, BAAI proposes NLPE(Non-Linearized Position Embedding), which applies two different position embeddings to `q` and `k` based on whether the distance between `q` and `k` exceeds a pre-defined threshold, producing `q1, q2` and `k1, k2`. Then the attention score is computed as the dot product of `q1, k1` or `q2, k2` depending on the distance between `q` and `k`.

Expand Down Expand Up @@ -214,7 +218,8 @@ The performance of piecewise_attention has improved compared to that in v0.1. In
- support causal and non-causal modes;
- support forward & backward modes;
- the sequence length of k/v can be different from that of q;
- support computation of total attention of each `k` gets from all `q`'s.
- support computation of total attention of each `k` gets from all `q`'s;
- supports returning accumulative attention of each keys.

#### Limitations

Expand All @@ -227,3 +232,8 @@ The performance of piecewise_attention has improved compared to that in v0.1. In
2. Test on more versions of triton;
3. Improve performance of attention operators(especially for the backward op);
4. Support other extensions to flash attention.

## More

For more about the open source system for large models from BAAI, please with [BAAI?FlagOpen](https://flagopen.baai.ac.cn/).
[<img src="./assets/logo/baai-flagopen.jpeg">](https://flagopen.baai.ac.cn/)
11 changes: 10 additions & 1 deletion README_cn.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
# FlagAttention

<p align="center">
<img src="./assets/logo/horizontal-blue.png" width = "400" alt="flag-attention" >
</p>

[English](./README.md)


Expand Down Expand Up @@ -66,7 +70,7 @@ pip install dist/flag_attn-xxx.whl

## 使用方式

FlagAttention 提供了自定义的 attention 算子。当一个算子的功能和 torch 函数等价的时候,就可以用它替换对应的 torch 函数。
FlagAttention 提供了自定义的 attention 算子。当一个算子的功能和 torch 函数等价的时候,就可以用它替换对应的 torch 函数。

## 运行测试

Expand Down Expand Up @@ -220,3 +224,8 @@ print(gq)
2. 在更多 Triton 版本上进行测试;
3. 提高算子的性能;
4. 支持对 FlashAttention 的其他功能扩展。

## 更多

关于智源研究院的更多大模型开源技术,请访问 [BAAI/FlagOpen](https://flagopen.baai.ac.cn/) 查看。
[<img src="./assets/logo/baai-flagopen.jpeg">](https://flagopen.baai.ac.cn/)
Binary file added assets/logo/baai-flagopen.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/logo/horizontal-blue.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/logo/horizontal-dark.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/logo/horizontal-light.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/logo/vertical-blue.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/logo/vertical-dark.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/logo/vertical-light.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit a4e7b02

Please sign in to comment.