Loading...
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 | /* SPDX-License-Identifier: GPL-2.0-only */ /* * Copyright 2012 Xyratex Technology Limited * * Using hardware provided PCLMULQDQ instruction to accelerate the CRC32 * calculation. * CRC32 polynomial:0x04c11db7(BE)/0xEDB88320(LE) * PCLMULQDQ is a new instruction in Intel SSE4.2, the reference can be found * at: * http://www.intel.com/products/processor/manuals/ * Intel(R) 64 and IA-32 Architectures Software Developer's Manual * Volume 2B: Instruction Set Reference, N-Z * * Authors: Gregory Prestas <Gregory_Prestas@us.xyratex.com> * Alexander Boyko <Alexander_Boyko@xyratex.com> */ #include <linux/linkage.h> .section .rodata .align 16 /* * [x4*128+32 mod P(x) << 32)]' << 1 = 0x154442bd4 * #define CONSTANT_R1 0x154442bd4LL * * [(x4*128-32 mod P(x) << 32)]' << 1 = 0x1c6e41596 * #define CONSTANT_R2 0x1c6e41596LL */ .Lconstant_R2R1: .octa 0x00000001c6e415960000000154442bd4 /* * [(x128+32 mod P(x) << 32)]' << 1 = 0x1751997d0 * #define CONSTANT_R3 0x1751997d0LL * * [(x128-32 mod P(x) << 32)]' << 1 = 0x0ccaa009e * #define CONSTANT_R4 0x0ccaa009eLL */ .Lconstant_R4R3: .octa 0x00000000ccaa009e00000001751997d0 /* * [(x64 mod P(x) << 32)]' << 1 = 0x163cd6124 * #define CONSTANT_R5 0x163cd6124LL */ .Lconstant_R5: .octa 0x00000000000000000000000163cd6124 .Lconstant_mask32: .octa 0x000000000000000000000000FFFFFFFF /* * #define CRCPOLY_TRUE_LE_FULL 0x1DB710641LL * * Barrett Reduction constant (u64`) = u` = (x**64 / P(x))` = 0x1F7011641LL * #define CONSTANT_RU 0x1F7011641LL */ .Lconstant_RUpoly: .octa 0x00000001F701164100000001DB710641 #define CONSTANT %xmm0 #ifdef __x86_64__ #define BUF %rdi #define LEN %rsi #define CRC %edx #else #define BUF %eax #define LEN %edx #define CRC %ecx #endif .text /** * Calculate crc32 * BUF - buffer (16 bytes aligned) * LEN - sizeof buffer (16 bytes aligned), LEN should be grater than 63 * CRC - initial crc32 * return %eax crc32 * uint crc32_pclmul_le_16(unsigned char const *buffer, * size_t len, uint crc32) */ SYM_FUNC_START(crc32_pclmul_le_16) /* buffer and buffer size are 16 bytes aligned */ movdqa (BUF), %xmm1 movdqa 0x10(BUF), %xmm2 movdqa 0x20(BUF), %xmm3 movdqa 0x30(BUF), %xmm4 movd CRC, CONSTANT pxor CONSTANT, %xmm1 sub $0x40, LEN add $0x40, BUF cmp $0x40, LEN jb less_64 #ifdef __x86_64__ movdqa .Lconstant_R2R1(%rip), CONSTANT #else movdqa .Lconstant_R2R1, CONSTANT #endif loop_64:/* 64 bytes Full cache line folding */ prefetchnta 0x40(BUF) movdqa %xmm1, %xmm5 movdqa %xmm2, %xmm6 movdqa %xmm3, %xmm7 #ifdef __x86_64__ movdqa %xmm4, %xmm8 #endif pclmulqdq $0x00, CONSTANT, %xmm1 pclmulqdq $0x00, CONSTANT, %xmm2 pclmulqdq $0x00, CONSTANT, %xmm3 #ifdef __x86_64__ pclmulqdq $0x00, CONSTANT, %xmm4 #endif pclmulqdq $0x11, CONSTANT, %xmm5 pclmulqdq $0x11, CONSTANT, %xmm6 pclmulqdq $0x11, CONSTANT, %xmm7 #ifdef __x86_64__ pclmulqdq $0x11, CONSTANT, %xmm8 #endif pxor %xmm5, %xmm1 pxor %xmm6, %xmm2 pxor %xmm7, %xmm3 #ifdef __x86_64__ pxor %xmm8, %xmm4 #else /* xmm8 unsupported for x32 */ movdqa %xmm4, %xmm5 pclmulqdq $0x00, CONSTANT, %xmm4 pclmulqdq $0x11, CONSTANT, %xmm5 pxor %xmm5, %xmm4 #endif pxor (BUF), %xmm1 pxor 0x10(BUF), %xmm2 pxor 0x20(BUF), %xmm3 pxor 0x30(BUF), %xmm4 sub $0x40, LEN add $0x40, BUF cmp $0x40, LEN jge loop_64 less_64:/* Folding cache line into 128bit */ #ifdef __x86_64__ movdqa .Lconstant_R4R3(%rip), CONSTANT #else movdqa .Lconstant_R4R3, CONSTANT #endif prefetchnta (BUF) movdqa %xmm1, %xmm5 pclmulqdq $0x00, CONSTANT, %xmm1 pclmulqdq $0x11, CONSTANT, %xmm5 pxor %xmm5, %xmm1 pxor %xmm2, %xmm1 movdqa %xmm1, %xmm5 pclmulqdq $0x00, CONSTANT, %xmm1 pclmulqdq $0x11, CONSTANT, %xmm5 pxor %xmm5, %xmm1 pxor %xmm3, %xmm1 movdqa %xmm1, %xmm5 pclmulqdq $0x00, CONSTANT, %xmm1 pclmulqdq $0x11, CONSTANT, %xmm5 pxor %xmm5, %xmm1 pxor %xmm4, %xmm1 cmp $0x10, LEN jb fold_64 loop_16:/* Folding rest buffer into 128bit */ movdqa %xmm1, %xmm5 pclmulqdq $0x00, CONSTANT, %xmm1 pclmulqdq $0x11, CONSTANT, %xmm5 pxor %xmm5, %xmm1 pxor (BUF), %xmm1 sub $0x10, LEN add $0x10, BUF cmp $0x10, LEN jge loop_16 fold_64: /* perform the last 64 bit fold, also adds 32 zeroes * to the input stream */ pclmulqdq $0x01, %xmm1, CONSTANT /* R4 * xmm1.low */ psrldq $0x08, %xmm1 pxor CONSTANT, %xmm1 /* final 32-bit fold */ movdqa %xmm1, %xmm2 #ifdef __x86_64__ movdqa .Lconstant_R5(%rip), CONSTANT movdqa .Lconstant_mask32(%rip), %xmm3 #else movdqa .Lconstant_R5, CONSTANT movdqa .Lconstant_mask32, %xmm3 #endif psrldq $0x04, %xmm2 pand %xmm3, %xmm1 pclmulqdq $0x00, CONSTANT, %xmm1 pxor %xmm2, %xmm1 /* Finish up with the bit-reversed barrett reduction 64 ==> 32 bits */ #ifdef __x86_64__ movdqa .Lconstant_RUpoly(%rip), CONSTANT #else movdqa .Lconstant_RUpoly, CONSTANT #endif movdqa %xmm1, %xmm2 pand %xmm3, %xmm1 pclmulqdq $0x10, CONSTANT, %xmm1 pand %xmm3, %xmm1 pclmulqdq $0x00, CONSTANT, %xmm1 pxor %xmm2, %xmm1 pextrd $0x01, %xmm1, %eax RET SYM_FUNC_END(crc32_pclmul_le_16) |