Class GptBytePairEncodingParams

java.lang.Object
com.knuddels.jtokkit.api.GptBytePairEncodingParams

public final class GptBytePairEncodingParams extends Object
Parameter for the byte pair encoding used to tokenize for the OpenAI GPT models.

This library supports the encodings that are listed in EncodingType out of the box. But if you want to use a custom encoding, you can use this class to pass the parameters to the library. Use EncodingRegistry.registerGptBytePairEncoding(GptBytePairEncodingParams) to register your custom encoding to the registry, so that you can easily use your encoding in conjunction with the predefined ones.

The encoding parameters are:

  • name: The name of the encoding. This is used to identify the encoding and must be unique.
  • pattern: The pattern that is used to split the input text into tokens.
  • encoder: The encoder that maps the tokens to their ids.
  • specialTokensEncoder: The encoder that maps the special tokens to their ids.
  • Constructor Details

    • GptBytePairEncodingParams

      public GptBytePairEncodingParams(String name, Pattern pattern, Map<byte[],Integer> encoder, Map<String,Integer> specialTokensEncoder)
      Creates a new instance of GptBytePairEncodingParams.
      Parameters:
      name - the name of the encoding. This is used to identify the encoding and must be unique
      pattern - the pattern that is used to split the input text into tokens.
      encoder - the encoder that maps the tokens to their ids
      specialTokensEncoder - the encoder that maps the special tokens to their ids
  • Method Details

    • getName

      public String getName()
    • getPattern

      public Pattern getPattern()
    • getEncoder

      public Map<byte[],Integer> getEncoder()
    • getSpecialTokensEncoder

      public Map<String,Integer> getSpecialTokensEncoder()