Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SR-15712] UBP<UInt8>/URBP.elementsEqual is slow #57990

Open
karwa opened this issue Jan 10, 2022 · 2 comments
Open

[SR-15712] UBP<UInt8>/URBP.elementsEqual is slow #57990

karwa opened this issue Jan 10, 2022 · 2 comments
Labels
bug A deviation from expected or documented behavior. Also: expected but undesirable behavior. compiler The Swift compiler in itself performance standard library Area: Standard library umbrella

Comments

@karwa
Copy link
Contributor

karwa commented Jan 10, 2022

Previous ID SR-15712
Radar rdar://97613773
Original Reporter @karwa
Type Bug
Environment

Xcode 13.2.1, macOS 11.6

Also tested on Swift version 5.6-dev (LLVM 031406b6cea8006, Swift 9f7d3fc)
Target: x86_64-unknown-linux-gnu

Additional Detail from JIRA
Votes 0
Component/s Compiler
Labels Bug
Assignee None
Priority Medium

md5: ac54ed7b0a71df80a5ca916293426167

Issue Description:

Consider the following example, demonstrating a IPv4 address type:

func test(_ lhs: IPv4Address, _ rhs: IPv4Address) -> Bool {
    return lhs == rhs
}

public struct IPv4Address: Equatable {
  public typealias Octets = (UInt8, UInt8, UInt8, UInt8)
  public var octets: Octets

  public init(octets: Octets) { 
      self.octets = octets
  }

  @inlinable
  public static func == (lhs: Self, rhs: Self) -> Bool {
      return withUnsafeBytes(of: lhs.octets) { lhsBytes in
          return withUnsafeBytes(of: rhs.octets) { rhsBytes in
              return lhsBytes.elementsEqual(rhsBytes)
          }
      }
  }
}

This generates quite a lot of code, including calls to runtime functions. This is clearly suboptimal, and is so large it won't even be inlined.

output.test(output.IPv4Address, output.IPv4Address) -> Swift.Bool:
        push    rbp
        push    rbx
        sub     rsp, 24
        mov     ebp, esi
        mov     dword ptr [rsp + 16], edi
        lea     rbx, [rsp + 16]
        lea     rdi, [rip + (demangling cache variable for type metadata for (Swift.UInt8, Swift.UInt8, Swift.UInt8, Swift.UInt8))]
        call    __swift_instantiateConcreteTypeFromMangledName
        lea     r8, [rsp + 20]
        mov     dword ptr [rsp + 8], ebp
        lea     rsi, [rsp + 8]
        lea     r9, [rsp + 12]
.LBB1_1:
        cmp     r8, rbx
        je      .LBB1_2
        movzx   edi, byte ptr [rbx]
        lea     rbp, [rbx + 1]
        jmp     .LBB1_3
.LBB1_2:
        xor     edi, edi
        mov     rbp, rbx
.LBB1_3:
        cmp     r9, rsi
        sete    al
        je      .LBB1_4
        movzx   ecx, byte ptr [rsi]
        lea     rdx, [rsi + 1]
        cmp     r8, rbx
        jne     .LBB1_6
        jmp     .LBB1_8
.LBB1_4:
        xor     ecx, ecx
        mov     rdx, rsi
        cmp     r8, rbx
        je      .LBB1_8
.LBB1_6:
        xor     eax, eax
        cmp     r9, rsi
        je      .LBB1_8
        mov     rbx, rbp
        mov     rsi, rdx
        cmp     dil, cl
        je      .LBB1_1
.LBB1_8:
        add     rsp, 24
        pop     rbx
        pop     rbp
        ret

If we replace the calls to elementsEqual with the C library function memcmp, things are much better:

return withUnsafeBytes(of: lhs.octets) { lhsBytes in
  return withUnsafeBytes(of: rhs.octets) { rhsBytes in
    memcmp(lhsBytes.baseAddress!, rhsBytes.baseAddress!, 4) == 0
    // return lhsBytes.elementsEqual(rhsBytes)
  }
}
output.test(output.IPv4Address, output.IPv4Address) -> Swift.Bool:
        cmp     edi, esi
        sete    al
        ret

Wow. From all of that to just a simple 'cmp' instruction. What a difference! Makes sense - this is just a 32-bit integer, which we're choosing to store as a 4-element tuple of bytes. This is an enormous win both for code-size and performance.

In general, I think a lot more comparisons between contiguous buffers of elements with no padding should optimise to memcmp. It's a lot faster than the generic specialisation for elementsEqual, and the compiler already knows what memcmp "means". Instead, I'm having to add functions like this to my library:

import Glibc

extension Sequence where Element == UInt8 {
  @inline(__always)  // wCSIA
  internal func fastElementsEqual<Other>(_ other: Other) -> Bool where Other: Sequence, Other.Element == UInt8 {
    let contiguousResult = withContiguousStorageIfAvailable { selfBuffer in
      other.withContiguousStorageIfAvailable { otherBuffer in
        selfBuffer.memcmpElementsEqual(otherBuffer)
      }
    }
    switch contiguousResult {
    case .some(.some(let result)):
      return result
    default:
      return elementsEqual(other)
    }
  }
}

extension UnsafeBufferPointer where Element == UInt8 {
  internal func memcmpElementsEqual(_ other: UnsafeBufferPointer<UInt8>) -> Bool {
    count == other.count ? memcmp(baseAddress, other.baseAddress, count) == 0 : false
  }
}

Godbolt link.

@swift-ci swift-ci transferred this issue from apple/swift-issues Apr 25, 2022
@CodaFi CodaFi added the standard library Area: Standard library umbrella label Jul 26, 2022
@CodaFi
Copy link
Member

CodaFi commented Jul 26, 2022

Mirroring this as rdar://97613773

@karwa
Copy link
Contributor Author

karwa commented Jul 26, 2022

Interesting!

The Godbolt link includes the repro code, and it also has this implementation at the end, #if-ed out:

        // This is a little better, but still nowhere near as good as memcmp.
        lhs.octets.0 == rhs.octets.0 &&
        lhs.octets.1 == rhs.octets.1 &&
        lhs.octets.2 == rhs.octets.2 &&
        lhs.octets.3 == rhs.octets.3

It turns out that in Swift versions 5.3 to 5.6 (inclusive), this does a bunch of shifting and individual byte comparisons at -O:

output.test(output.IPv4Address, output.IPv4Address) -> Swift.Bool:
        xor     eax, eax
        cmp     dil, sil
        jne     .LBB1_4
        mov     ecx, edi
        shr     ecx, 8
        mov     edx, esi
        shr     edx, 8
        cmp     cl, dl
        jne     .LBB1_4
        mov     ecx, edi
        shr     ecx, 16
        mov     edx, esi
        shr     edx, 16
        cmp     cl, dl
        jne     .LBB1_4
        shr     esi, 24
        shr     edi, 24
        cmp     dil, sil
        sete    al
.LBB1_4:
        ret

BUT - the nightly and 5.7 nightly builds do, in fact, fuse this implementation in to a single comparison. Just like we got with memcmp. That's really cool!

output.test(output.IPv4Address, output.IPv4Address) -> Swift.Bool:
        cmp     edi, esi
        sete    al
        ret

It would still be good for at least the unsafe buffer types to delegate to memcmp where possible, I think - for code-size if nothing else; we shouldn't be specialising elementsEqual for them, and possibly not for other contiguous collections, either, if we can help it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A deviation from expected or documented behavior. Also: expected but undesirable behavior. compiler The Swift compiler in itself performance standard library Area: Standard library umbrella
Projects
None yet
Development

No branches or pull requests

3 participants