-

A PREPRINT - APRIL

ROF O R M ER : E N H AN C ED T R A NS FO R M ER W I TH ROTARY P O SI T I ON E M B ED D I N G

Jianlin Su

bojonesu@wezhuiyi.com 0

Yu Lu

julianlu@wezhuiyi.com 0

Shengfeng Pan

nickpan@wezhuiyi.com 0

Bo Wen

brucewen@wezhuiyi.com 0

Yunfeng Liu

glenliu@wezhuiyi.com 0 0 Zhuiyi Technology Co., Ltd. , Shenzhen

2021

21 2021

Position encoding in transformer architecture provides supervision for dependency modeling between elements at different positions in the sequence. We investigate various methods to encode positional information in transformer-based language models and propose a novel implementation named Rotary Position Embedding(RoPE). The proposed RoPE encodes absolute positional information with rotation matrix and naturally incorporates explicit relative position dependency in self-attention formulation. Notably, RoPE comes with valuable properties such as flexibility of being expand to any sequence lengths, decaying inter-token dependency with increasing relative distances, and capability of equipping the linear self-attention with relative position encoding. As a result, the enhanced transformer with rotary position embedding, or RoFormer, achieves superior performance in tasks with long texts. We release the theoretical analysis along with some preliminary experiment results on Chinese data. The undergoing experiment for English benchmark will soon be updated.

X r a

1 Introduction

The sequential order of words plays a vital role in natural language. Recurrent-based models (RNNs) encode tokens’ order by recursively computing a hidden state along the time dimension. Convolution-based models (CNNs) [5] were typically considered position-agnostic, but recent work [9] has shown that the commonly used padding operation can implicitly learn position information. In recent years, the effectiveness of transformer-based models was shown on various natural language processing (NLP) tasks such as context representation learning [4], machine translation [21], and language modeling [16]. Unlike recurrent-based and convolution-based models, transformer-based models utilize the self-attention architecture to capture the dependency among tokens in the context, which provides better parallelization than RNNs and can model longer intra-token relations than CNNs. 1 Since transformer-based models contain no recurrence and no convolution, and the self-attention architecture is shown to be position-agnostic [26], different approaches have been proposed to inject position information into the model. One line of works focuses on absolute position encoding, where absolute position encoding which are trainable [5, 4, 12, 2, 16, 15] or generated by pre-defined function [21] were added to context representations. The other line of works [14, 18, 7, 3, 25, 17, 11, 6, 8] focuses on relative position encoding, which typically injects relative position information into the attention calculation. In addition to these approaches, [13] has proposed to model the dependency of position encoding from the perspective with Neural ODE [1], and [22] has proposed to model the 1A stack of multiple CNN layers can also capture longer intra-token relation, here we only consider single layer setting. position information in complex space.

In this work, we first establish a formal description of the position encoding problem in self-attention architecture and revisit previous works in Section 2. We then propose the rotary position encoding (RoPE) and study its properties in Section 3. We report preliminary experiments in section 4. Our contributions are as follows: • We investigate previous works on relative position encoding and find most of them based on the decomposition of adding position encoding to the context representations. We propose to encode relative position by multiplying the context representations with a rotation matrix with a clear theoretical interpretation. • We study the properties of RoPE and show that it decays with the relative distance increased, which is desired for natural language encoding. We argue that previous relative position encoding approaches are not compatible with linear self-attention and show that RoPE can be used in such mechanism. • We demonstrate that RoFormer shows superior performance than peer models dealing with long texts. Preliminary experiments with pre-trained Chinese RoFormer 2 is carried out on downstream tasks. Benchmarks on

English dataset is undergoing and will be released once finished.

Let SN = fwigiN=1 be a sequence of N input tokens with wi being the ith element. The corresponding word embedding of SN can be denoted as EN = fxigiN=1, where xi 2 Rd is the d-dimensional word embedding vector of token wi without position information. The self-attention first incorporates position information to the word embeddings and transforms them into queries, keys, and value representations. ( 1 ) ( 2 ) ( 3 ) ( 4 ) qm = fq(xm; m) kn = fk(xn; n) vn = fv(xn; n) am;n = om = exp( q|mpkn )

d j=1 exp( qp|mdkj ) PN N n=1

X am;nvn Where qm; kn; vn incorporates the mth and nth position by functions fq; fk and fv respectively. The attention weights are then calculated using query and key representations, and the output can be computed as the weighted sum of value representations.

The research on position encoding of transformer mainly focuses on choosing suitable function forms of eq. ( 1 ). 2.2

Absolute position embedding A typical choice of eq. (1) is

ft:t2fq;k;vg(xi; i) := W t:t2fq;k;vg(xi + pi) Where pi 2 Rd is a d-dimensional vector depending of the position of token xi. [4, 12, 2, 16, 15] used a set of trainable vectors pi 2 fptgtL=1, where L is the maximum sequence length. On the other hand, [21] has proposed to generate pi using the sinusoidal function.

pi;2t pi;2t+1 = sin(k=100002t=d) = cos(k=100002t=d) Where pi;2t is the 2tth element of the d-dimensional vector pi. In Section x, we will show that our proposed RoPE is related to this approach from the perspective of using the sinusoidal function, but ours incorporates relative position information by multiplying sinusoidal function to the context representation instead of adding.

2The code and pre-trained Chinese model are available at https://github.com/ZhuiyiTechnology/roformer

( 5 ) ( 6 ) ( 7 ) ( 8 ) ( 9 ) [18] used a different setting of eq. ( 1 ) as following:

fq(xm) := W qxm fk(xn; n) := W k(xn + p~rk) fv(xn; n) := W v(xn + p~rv) Where p~rk; p~rv 2 Rd are trainable relative position embeddings. Note that r = clip(m n; rmin; rmax) represents the relative distance between position m and n. They clipped the relative distance with the hypothesis that precise relative position information is not useful beyond a certain distance.

Keeping the form of eq. ( 3 ), [3] has proposed to decompose the q|mkn term in eq. ( 2 ) as

q|mkn = x|mW q|W kxn + x|mW q|W kpn + p|mW q|W kxn + p|mW q|W kpn They replaced absolute position embedding pn with its sinusoid-encoded relative counterpart p~m n and replaced absolute position pm in the third and fourth term with two trainable vectors u, v independent of the query positions. Further, W k is distinguished for the content-based and location-based key vectors xn and pn, denoted as W k and Wf k, resulting in:

q|mkn = x|mW q|W kxn + x|mW q|Wf kp~m n + u|W q|W kxn + v|W q|Wf kp~m n It is worth mentioning that they remove the position information in the value term by setting fv(xj ) := W vxj . Later works [17, 6, 11, 8] followed this step by only considering inject relative position information into the attention weights. [17] revised eq. ( 6 ) as

q|mkn = x|mW q|W kxn + bi;j Where bi;j is a trainable bias. [11] investigated the middle two terms of ?? and found little correlations between absolute positions and words. Follow [17], they have proposed to model a pair of words or positions by using different projection matrices.

q|mkn = x|mW q|W kxn + p|mUq|Ukpn + bi;j [6] argued that a relative positions of word pair can only be fully modeled by using both the middle two terms of ??, so they have proposed to replace the absolute position embeddings pm and pn in these two terms with relative position embeddings p~m n.

| q|mkn = x|mW q|W kxn + x|mW q|W kp~m n + p~m nW q|W kxn ( 10 ) [15] has compared four variants of relative position embeddings and shown that the variant similar to eq. ( 10 ) is the most efficient among the other three.

All these works modified eq. ( 6 ) based on the decomposition of eq. ( 3 ) under the self-attention setting in eq. ( 2 ), which is originally from [21]. They share the same nature that the position information is injected by deliberately adding to the context representations. Different from these work, our approach aims to derive the relative position encoding directly from eq. ( 1 ) under some constraints. In Section xxx, we show that the derived approach is more interpretable by incorporating relative position information with the rotation of context representations. 3

Proposed Approach

3.1

Formulation

In this section, we discuss the proposed rotary position embedding (RoPE). We first formulate the relative position encoding problem in section (3.1), we then derive the RoPE in section (3.2) and investigate its properties in section (3.3). Language modeling in Transformer integrates position information of individual tokens through self-attention. We start from eq. ( 1 ) and notice that the q|mkn term in eq. ( 2 ) actually facilitates information exchange between tokens at different positions. In order to incorporate relative position information, we require the inner product of query qm and key kn be formulated by a function g, which takes only the word embeddings xm, xn, and their relative position m n as input variables. In other words, we hope the inner product encodes position information only in the relative form: 3.2.2

General form where g(xm; xn; m

n) = Re[(W qxm)(W kxn) ei(m n) ] where Re[ ] is the real part of a complex number and (W kxn) represents the conjugate complex number of (W kxn). 2 R is a preset non-zero constant. We can further write ffq;kg in matrix multiplication: ffq;kg(xm; m) = cos m sin m

sin m cos m

Wf(q1;1k)g Wf(q2;1k)g

Wf(q1;2k)g ! Wf(q2;2k)g x(m1) ! x(m2) where (x(m1); x(m2)) is xm expressed in the 2D coordinates. Similarly, function g can be turned into matrix form. Thus, we come up with a solution to formulation in section 3.1 under the 2D case. Specifically, incorporating relative position embedding is straightforward: simply rotate the affine-transformed word embedding vector by amount of angle in multiples of its position index. Due to this characteristic, we name it Rotary Position Embedding. In order to generalize our result in 2D to any xi 2 Rd where d is even, we divide the d-dimension space in to d=2 sub-spaces and combine them in merit of the linearity of inner product, turning ffq;kg into: hfq(xm; m); fk(xn; n)i = g(xm; xn; m n) Next, finding such a encoding mechanism is equivalent to solve the function fq(xm; m) and fk(xn; n) that conforms above relation. 3.2 3.2.1

Rotary Position Embedding A 2D case

We start from simple case with dimension d = 2. Under this setting, we make use of the geometric property of vectors on 2D plane and its complex form to prove (refer to Appendix A for more details) that a solution to our formulation eq. ( 11 ) is: ( 11 ) ( 12 ) ( 13 ) ( 14 ) ( 15 ) ( 16 ) fk(xn; n) = (W kxn)ein 0 0 sin m 2 cos m 2 . . . 0 0 . . .

q|mkn = (Rd ;mW qxm)|(Rd ;nW kxn) = x|W qRd ;n mW kxn where Rd ;n m = (Rd ;m)|Rd ;n. Notice that Rd is an orthogonal matrix, which ensures the stability during the process of encoding position information. In addition, due to the sparsity of Rd , applying matrix multiplication directly as in eq. ( 16 ) is not computational efficient, we provide another realization in Appendix B. In contrast to the additive nature of position embedding method use by other works, i.e. eqs. ( 3 ) to ( 10 ), our approach is multiplicative. Moreover, our RoPE naturally incorporates relative position through rotation matrix product instead of altering terms of additive position encoding in self-attention.

1 2 3 4 5 6

Properties of RoPE

Long-term decay: Following [21], we choose i = 10000 2i=d. One can prove that this setting provides a long-term decay property (refer to Appendix C for more details), which means the inner-product will decay when the relative position increase. This property coincides with the intuition that a pair of tokens with long relative distance should have less connection.

RoPE with linear attention: The self-attention can be rewritten in a more general form.

The original self-attention chooses sim(qm; kn) = exp(q|mkn=pd). Notice that the original self-attention need to compute the inner product of query and key for every pair of tokens, which has quadratic complexity O(N 2). Follow [10] , linear attentions reformulate equation 17 as

Attention(Q; K; V )m =

PN n=1 (qm)|'(kn)vn PN n=1 (qm)|'(kn) where ( ); '( ) are usually non-negative functions. [10] has proposed (x) = '(x) = elu(x) + 1 and first computed the multiplication between keys and values using the associative property of matrix multiplication. [19] has proposed to use softmax function to normalize queries and keys separately before the inner product, which is equivalent to (qi) = softmax(qi) and (kj ) = exp(kj ). For more details about linear attentions, we encourage readers to refer to original papers. In this section, we focus on discussing incorporating RoPE with equation 18. Since RoPE injects x'2 x2

mθ1 x'1 x1 (x'1, x'2)

Enhanced Transformer

with Rotary

Position Embedding position information by rotation, which keeps the norm of hidden representations unchanged, we can combine RoPE with linear attentions by multiplying the rotation matrix with the outputs of the non-negative functions. It is worth mentioning that we keep the denominator unchanged to avoid the risk of dividing zero, and the summation in the numerator could contain negative terms. Although the weights for each value vi in equation 19 are not strictly probabilistic normalized, we argue that such computation can still model the importance of values. 4

Experiment 4.1 Implementation

In this section, we show some preliminary experiment results of RoFormer in Chinese language modeling, complete benchmark on English tasks is in progress and will be released once done. We discuss the implementation of RoFormer in section (4.1) and show the pre-training results with Chinese language in section (4.2). To illustrate the performance of RoFormer on long texts, we choose a task with most documents exceeding 512 characters (section (4.3)) for downstream evaluation and discuss the results in section (4.4).

Our RoFormer model is based on WoBERT[20], from which we replace the absolute position embedding with our proposed RoPE. For cross-comparison with other pre-trained Transformer-based models in Chinese, i.e. BERT[4], WoBERT[20], and NEZHA[23], we tabulate their tokenization level and position embedding in table ( 1 ). We pre-train RoFormer on approximately 34GB data which consists of contents from Chinese Wikipedia, news, forums, etc. The pre-training was done in multiple stages with changing batch size and maximum input sequence length in order to adapt the model with various scenarios, as shown in table ( 2 ). According to table ( 2 ), the accuracy of RoFormer elevates with increasing upper bound of sequence length, which demonstrates the ability of RoFormer in dealing with long texts. We claim that this is attribute to the excellent generalizability of the proposed RoPE. 4.3

Downstream Tasks & Dataset

We choose Chinese AI and Law 2019 Similar Case Matching dataset(CAIL2019-SCM)[24] to illustrate the ability of

RoFormer dealing with long texts in semantic text matching task.

CAIL2019-SCM contains 8964 triplets of cases published by the Supreme People’s Court of China. The input triplet, denoted as (A, B, C), are fact descriptions of three cases. The task is to predict whether the pair (A, B) is closer than (A, C) under a predefined similarity measure. Due to the background of CAIL2019-SCM dataset, most of its documents contain more than 512 characters, which is challenging for existing methods to capture document level information.

The amount of data we used in our experiment in shown in table (3).

We apply the pre-trained RoFormer model discussed in section (4.2) to downstream task CAIL2019-SCM with different input lengths. The model is compared with the pre-trained BERT and WoBERT model on the same pre-training data, as shown in table ( 4 ). With short text cut-offs, i.e. 512, RoFormer achieves comparable result to WoBERT and is slightly better than BERT implementation. However, when increase the maximum input text length to 1024 RoFormer outperforms WoBERT by an absolute improvement of 1.5%.

Conclusions

In this work, we proposed a new position embedding method that incorporates explicit relative position dependency in self-attention to enhance the performance of transformer architectures. Our theoretical analysis showed that relative position can be naturally formulated using vector production in self-attention, after absolution position information being encoded through rotation matrix. In addition, we mathematically illustrated the advantageous properties of the proposed method when applied in transformer. Finally, our preliminary experiment on Chinese data demonstrated that the enhanced transformer performs superior in tasks with long text. Detailed experiments on English benchmarks are in progress and will be released soon. [23] Victor Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen, and Qun Liu. Nezha: Neural contextualized representation for chinese language understanding. 08 2019. [26] Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank Reddi, and Sanjiv Kumar. Are transformers universal approximators of sequence-to-sequence functions? In International Conference on Learning Representations, 2020. Here the subscripts of qm, kn indicates the encoded position information. Assume there exists function g that defines inner product between vectors produced by ffq;kg: q|mkn = hfq(xm; m); fk(xn; n)i = g(xm; xn; n m)

We further ask below initial condition to be satisfied:

which denotes the vectors with empty position information encoded. With above settings, we manage to find a solution of fq, fk.

First, we take advantage of the geometric meaning of vector in 2D and its complex counter part, decompose functions in eqs. ( 20 ) and ( 21 ) into: fq(xq; m) = Rq(xq; m)ei q(xq;m) fk(xk; n) = Rk(xk; n)ei k(xk;n) g(xq; xk; n m) = Rg(xq; xk; n

m)ei g(xq;xk;n m) Rq(xq; m)Rk(xk; n) = Rg(xq; xk; n k(xk; n) q(xq; m) = g(xq; xk; n m) m) q = kqkei q = Rq(xq; 0)ei q(xq;0) k = kkkei k = Rk(xk; 0)ei k(xk;0) Where Rf , Rg and f , g are the radical and angular components for ffq;kg and g, respectively. Plug them into eq. ( 21 ), we get relation: with the corresponding initial condition as:

A Derivation of RoPE under 2D

Under the case of d = 2, we consider two word embedding vectors xq, xk corresponds to query and key and their position m and n, respectively. According to eq. ( 1 ), their position-encoded counterparts are: ( 20 ) ( 21 ) ( 22 ) (23) ( 24 ) ( 25 ) (26a) (26b) (27) (28) (29) Where kqk, kkk and q, k are the radial and angular part of q and k on the 2D plane.

Next, we set m = n in eq. ( 24 ) and take into account initial conditions in eq. ( 25 ):

Rq(xq; m)Rk(xk; m) = Rq(xq; xk; 0) = Rk(xq; 0)Rk(xk; 0) = kqkkkk k(xk; m) q(xq; m) = g(xq; xk; 0) = k k(xk; 0) q(xq; 0)k = k k qk

On one hand, from eq. (26a), we have a straightforward solution of Rf : Which means the radial functions Rq, Rk and Rg are functions independent to position information.

On the other hand, according to eq. (26b), notice q(xq; m) q = k(xk; m) does not dependent on query and key, we set them to f := q = k and term f (xfq;kg; m) of position m and is independent of word embedding xfq;kg, we denote it as (m), yielding: k indicates that the angular functions fq;kg is a function Further, by plugging in n = m + 1 in eq. ( 24 ) and consider above equation, we have:

Rq(xq; m) = Rq(xq; 0) = kqk

Rk(xk; n) = Rk(xk; 0) = kkk Rg(xq; xk; n

m) = Rg(xq; xk; 0) = kqkkkk f (xfq;kg; m) = (m) + fq;kg (m + 1) (m) = g(xq; xk; 1) + q

k Since the RHS of the equation is a constant irrelevant to m, function (m) with continuous integer inputs produce an arithmetic progression. Thus, it is straightforward to get:

(m) = m +

Where ;

2 R are constants and is non-zero.

To summarize our solutions from Equations (27) to (30): fq(xq; m) = kqkei q+m + fk(xk; n) = kkkei k+n + = qei(m + ) = kei(n + ) q = fq(xm; 0) = W qxn k = fk(xn; 0) = W kxn Finally, notice that we haven’t set any constrains to functions in eq. ( 22 ), thus fq(xm; 0) and fk(xn; 0) are left to choose freely. To make our result be comparable to 3, here we simply set:

With above and simply set

= 0 in eq. (31), the ultimate solution is: B

Computational efficient realization of rotary matrix multiplication Taking the advantage of the sparsity of Rd ;m in eq. ( 15 ), a more computational efficient realization of multiplication operation between matrix Rd and vector x 2 Rd is: We can group entries of vectors q = W qxm and k = W kxn in pairs, and the inner product of RoPE in 16 can be written as complex number multiplication.

(Rd ;mW qxm)|(Rd ;nW kxn) = Re d=2 1 X q[2i:2i+1]k[2i:2i+1]ei(m n) i i=0 where q[2i:2i+1] represents the 2ith to (2i + 1)th entries of q. Denote hi = q[2i:2i+1]k[2i:2i+1] and Sj = Pij=01 ei(m n) i , and let hd=2 = 0 and S0 = 0, we can rewrite the summation using Abel transformation d=2 1 X i=0 i=0 d=2 1 X i=0 hi(Si+1

Si) =

Si+1(hi+1 hi) (36) (30) (31) (32) (33) (34) (35) 50 relative distance d=2 1 X Si+1(hi+1 i=0 d=2 1 X jSi+1jj(hi+1 i=0

(37)

d=2 1 hij Xi=0 jSi+1j n increases by setting i = 10000 2i=d, as shown in So we have the value of d1=2 Pid==21 jSij decay with the relative distance m fig. ( 2 ).

[1]

Tian

Qi Chen , Yulia Rubanova, Jesse Bettencourt, and

David

Duvenaud . Neural ordinary differential equations . In Samy Bengio, Hanna M. Wallach , Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018 , NeurIPS 2018 , December 3- 8 , 2018 , Montréal, Canada, pages 6572 - 6583 , 2018 .

[2]

Kevin

Clark , Minh-Thang

Luong

, Quoc

Le , and Christopher D.

Manning . ELECTRA: Pre-training text encoders as discriminators rather than generators . In ICLR , 2020 .

[3]

Zihang

Dai ,

Yang ,

Yiming

Yang , J. Carbonell, Quoc V. Le , and R. Salakhutdinov . Transformer-xl: Attentive language models beyond a fixed-length context . In ACL , 2019 .

[4]

Devlin , Ming-Wei

Chang

Kenton

Lee ,

and Kristina

Toutanova . Bert: Pre-training of deep bidirectional transformers for language understanding . In NAACL-HLT , 2019 .

[5]

Jonas

Gehring ,

Michael

Auli , David Grangier,

Denis

Yarats , and Yann N Dauphin . Convolutional sequence to sequence learning . In International Conference on Machine Learning , pages 1243 - 1252 . PMLR, 2017 .

[6]

Pengcheng

He , Xiaodong Liu,

Jianfeng

Gao , and

Weizhu

Chen . Deberta: Decoding-enhanced bert with disentangled attention . ArXiv , abs/ 2006 .03654, 2020 .

[7] Cheng-Zhi Anna

Huang

, Ashish Vaswani, Jakob Uszkoreit,

Noam

Shazeer , I. Simon,

Hawthorne , Andrew M. Dai , M.

Hoffman , M.

Dinculescu , and D.

Eck . Music transformer . arXiv: Learning , 2018 .

[8]

Zhiheng

Huang , Davis Liang, Peng Xu,

and Bing

Xiang . Improve transformer models with better relative position embeddings . In Findings of the Association for Computational Linguistics: EMNLP 2020 , pages 3327 - 3335 , Online, November 2020 . Association for Computational Linguistics .

[9]

Md. Amirul

Islam , Sen Jia, and Neil D. B. Bruce . How much position information do convolutional neural networks encode? ArXiv , abs/ 2001 .08248, 2020 .

[10] Angelos

Katharopoulos

, Apoorv Vyas, Nikolaos Pappas, and

François

Fleuret . Transformers are rnns: Fast autoregressive transformers with linear attention . In International Conference on Machine Learning , pages 5156 - 5165 . PMLR, 2020 .

[11] Guolin

He , and

Liu . Rethinking positional encoding in language pre-training . ArXiv , abs/ 2006 .15595, 2020 .

[12] Zhenzhong

Lan

, Mingda Chen,

Sebastian

Goodman , Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations . In International Conference on Learning Representations , 2020 .

[13] Xuanqing

Liu

, Hsiang-Fu

, Inderjit S. Dhillon, and Cho-Jui Hsieh . Learning to encode position for transformer with continuous dynamical model . In Proceedings of the 37th International Conference on Machine Learning , ICML 2020 , 13 - 18 July 2020 ,

Virtual

Event , volume 119 of Proceedings of Machine Learning Research , pages 6327 - 6335 . PMLR, 2020 .

[14] Ankur

Parikh , Oscar Täckström, Dipanjan Das , and Jakob Uszkoreit . A decomposable attention model for natural language inference . In EMNLP , 2016 .

[15]

Radford and

Karthik

Narasimhan . Improving language understanding by generative pre-training . 2018 .

[16]

Radford , Jeffrey Wu , R.

Child , David Luan, Dario

Amodei , and Ilya

Sutskever . Language models are unsupervised multitask learners . 2019 .

[17] Colin

Raffel

, Noam Shazeer, Adam Roberts,

Katherine

Lee ,

Sharan

Narang , Michael Matena, Yanqi Zhou , W. Li , and Peter J. Liu . Exploring the limits of transfer learning with a unified text-to-text transformer . J. Mach. Learn. Res. , 21 :140: 1 - 140 : 67 , 2020 .

[18]

Peter

Shaw , Jakob Uszkoreit, and

Ashish

Vaswani . Self-attention with relative position representations . In NAACL-HLT , 2018 .

[19] Zhuoran

Shen

, Mingyuan Zhang, Haiyu Zhao,

Shuai

Yi , and

Hongsheng

Li . Efficient attention: Attention with linear complexities . In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages 3531 - 3539 , 2021 .

[20]

Jianlin

Su . Wobert: Word-based chinese bert model - zhuiyiai . Technical report , 2020 .

[21] Ashish

Vaswani

, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez , L ukasz Kaiser, and Illia Polosukhin . Attention is all you need . In I. Guyon,

U. V.

Luxburg ,

Bengio ,

Wallach ,

Fergus ,

Vishwanathan , and R. Garnett, editors, Advances in Neural Information Processing Systems , volume 30 . Curran Associates, Inc., 2017 .

[22] Benyou

Wang

, Donghao Zhao , Christina

Lioma , Qiuchi

Li , Peng

Zhang , and Jakob Grue Simonsen. Encoding word order in complex embeddings . In International Conference on Learning Representations , 2020 .

[24] Chaojun

Xiao

, Haoxi Zhong, Zhipeng Guo, Cunchao Tu, Zhiyuan Liu, Maosong Sun, Tianyang Zhang, Xianpei Han, Zhen hu, Heng Wang, and Jianfeng Xu. Cail2019-scm: A dataset of similar case matching in legal domain . 11 2019 .

[25]

Yang , Zihang Dai,

Yiming

Yang , J. Carbonell, R. Salakhutdinov, and Quoc

Le . Xlnet: Generalized autoregressive pretraining for language understanding . In NeurIPS , 2019 . 100 150 200 Figure 2 : Long-term decay of RoPE. 250