Class CodePageUtil

java.lang.Object
org.apache.poi.util.CodePageUtil

public class CodePageUtil extends Object
Utilities for working with Microsoft CodePages.

Provides constants for understanding numeric codepages, along with utilities to translate these into Java Character Sets.

  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    static final int
    Codepage 037, a special case
    static final int
    Codepage for EUC-JP
    static final int
    Codepage for EUC-KR
    static final int
    Codepage for GB18030
    static final int
    Codepage for GB2312
    static final int
    Codepage for GBK, aka MS936
    static final int
    Codepage for ISO-2022-JP
    static final int
    Another codepage for ISO-2022-JP
    static final int
    Yet another codepage for ISO-2022-JP
    static final int
    Codepage for ISO-2022-KR
    static final int
    Codepage for ISO-8859-1
    static final int
    Codepage for ISO-8859-2
    static final int
    Codepage for ISO-8859-3
    static final int
    Codepage for ISO-8859-4
    static final int
    Codepage for ISO-8859-5
    static final int
    Codepage for ISO-8859-6
    static final int
    Codepage for ISO-8859-7
    static final int
    Codepage for ISO-8859-8
    static final int
    Codepage for ISO-8859-9
    static final int
    Codepage for Johab
    static final int
    Codepage for KOI8-R
    static final int
    Codepage for Macintosh Arabic (Java: MacArabic)
    static final int
    Codepage for Macintosh Central Europe (Latin-2) (Java: MacCentralEurope)
    static final int
    Codepage for Macintosh Chinese Simplified (Java: unknown - use EUC_CN, ISO2022_CN_GB, MS936 or cp935)
    static final int
    Codepage for Macintosh Chinese Traditional (Java: unknown - use Big5, MS950, or cp937)
    static final int
    Codepage for Macintosh Croatian (Java: MacCroatian)
    static final int
    Codepage for Macintosh Cyrillic (Java: MacCyrillic)
    static final int
    Codepage for Macintosh Greek (Java: MacGreek)
    static final int
    Codepage for Macintosh Hebrew (Java: MacHebrew)
    static final int
    Codepage for Macintosh Iceland (Java: MacIceland)
    static final int
    Codepage for Macintosh Japan (Java: unknown - use SJIS, cp942 or cp943)
    static final int
    Codepage for Macintosh Korean (Java: unknown - use EUC_KR or cp949)
    static final int
    Codepage for Macintosh Roman (Java: MacRoman)
    static final int
     
    static final int
    Codepage for Macintosh Romanian (Java: MacRomania)
    static final int
    Codepage for Macintosh Thai (Java: MacThai)
    static final int
    Codepage for Macintosh Turkish (Java: MacTurkish)
    static final int
    Codepage for Macintosh Ukrainian (Java: MacUkraine)
    static final int
    Codepage for MS949
    static final int
    Codepage for SJIS
    static final int
    Codepage for Unicode
    static final int
    Codepage for US-ASCII
    static final int
    Another codepage for US-ASCII
    static final int
    Codepage for UTF-16
    static final int
    Codepage for UTF-16 big-endian
    static final int
    Codepage for UTF-8
    static final int
    Codepage for Windows 1250
    static final int
    Codepage for Windows 1251
    static final int
    Codepage for Windows 1252
    static final int
     
    static final int
    Codepage for Windows 1253
    static final int
    Codepage for Windows 1254
    static final int
    Codepage for Windows 1255
    static final int
    Codepage for Windows 1256
    static final int
    Codepage for Windows 1257
    static final int
    Codepage for Windows 1258
    static final Set<Charset>
     
  • Constructor Summary

    Constructors
    Constructor
    Description
     
  • Method Summary

    Modifier and Type
    Method
    Description
    static String
    codepageToEncoding(int codepage)
    Turns a codepage number into the equivalent character encoding's name (in Java NIO canonical naming format).
    static String
    codepageToEncoding(int codepage, boolean javaLangFormat)
    Turns a codepage number into the equivalent character encoding's name, in either Java NIO or Java Lang canonical naming.
    static String
    cp950ToString(byte[] data, int offset, int lengthInBytes)
    This tries to convert a LE byte array in cp950 (Microsoft's dialect of Big5) to a String.
    static byte[]
    getBytesInCodePage(String string, int codepage)
    Converts a string into bytes, in the equivalent character encoding to the supplied codepage number.
    static String
    getStringFromCodePage(byte[] string, int codepage)
    Converts the bytes into a String, based on the equivalent character encoding to the supplied codepage number.
    static String
    getStringFromCodePage(byte[] string, int offset, int length, int codepage)
    Converts the bytes into a String, based on the equivalent character encoding to the supplied codepage number.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Field Details

    • DOUBLE_BYTE_CHARSETS

      public static final Set<Charset> DOUBLE_BYTE_CHARSETS
    • CP_037

      public static final int CP_037

      Codepage 037, a special case

      See Also:
    • CP_SJIS

      public static final int CP_SJIS

      Codepage for SJIS

      See Also:
    • CP_GBK

      public static final int CP_GBK

      Codepage for GBK, aka MS936

      See Also:
    • CP_MS949

      public static final int CP_MS949

      Codepage for MS949

      See Also:
    • CP_UTF16

      public static final int CP_UTF16

      Codepage for UTF-16

      See Also:
    • CP_UTF16_BE

      public static final int CP_UTF16_BE

      Codepage for UTF-16 big-endian

      See Also:
    • CP_WINDOWS_1250

      public static final int CP_WINDOWS_1250

      Codepage for Windows 1250

      See Also:
    • CP_WINDOWS_1251

      public static final int CP_WINDOWS_1251

      Codepage for Windows 1251

      See Also:
    • CP_WINDOWS_1252

      public static final int CP_WINDOWS_1252

      Codepage for Windows 1252

      See Also:
    • CP_WINDOWS_1252_BIFF23

      public static final int CP_WINDOWS_1252_BIFF23
      See Also:
    • CP_WINDOWS_1253

      public static final int CP_WINDOWS_1253

      Codepage for Windows 1253

      See Also:
    • CP_WINDOWS_1254

      public static final int CP_WINDOWS_1254

      Codepage for Windows 1254

      See Also:
    • CP_WINDOWS_1255

      public static final int CP_WINDOWS_1255

      Codepage for Windows 1255

      See Also:
    • CP_WINDOWS_1256

      public static final int CP_WINDOWS_1256

      Codepage for Windows 1256

      See Also:
    • CP_WINDOWS_1257

      public static final int CP_WINDOWS_1257

      Codepage for Windows 1257

      See Also:
    • CP_WINDOWS_1258

      public static final int CP_WINDOWS_1258

      Codepage for Windows 1258

      See Also:
    • CP_JOHAB

      public static final int CP_JOHAB

      Codepage for Johab

      See Also:
    • CP_MAC_ROMAN

      public static final int CP_MAC_ROMAN

      Codepage for Macintosh Roman (Java: MacRoman)

      See Also:
    • CP_MAC_ROMAN_BIFF23

      public static final int CP_MAC_ROMAN_BIFF23
      See Also:
    • CP_MAC_JAPAN

      public static final int CP_MAC_JAPAN

      Codepage for Macintosh Japan (Java: unknown - use SJIS, cp942 or cp943)

      See Also:
    • CP_MAC_CHINESE_TRADITIONAL

      public static final int CP_MAC_CHINESE_TRADITIONAL

      Codepage for Macintosh Chinese Traditional (Java: unknown - use Big5, MS950, or cp937)

      See Also:
    • CP_MAC_KOREAN

      public static final int CP_MAC_KOREAN

      Codepage for Macintosh Korean (Java: unknown - use EUC_KR or cp949)

      See Also:
    • CP_MAC_ARABIC

      public static final int CP_MAC_ARABIC

      Codepage for Macintosh Arabic (Java: MacArabic)

      See Also:
    • CP_MAC_HEBREW

      public static final int CP_MAC_HEBREW

      Codepage for Macintosh Hebrew (Java: MacHebrew)

      See Also:
    • CP_MAC_GREEK

      public static final int CP_MAC_GREEK

      Codepage for Macintosh Greek (Java: MacGreek)

      See Also:
    • CP_MAC_CYRILLIC

      public static final int CP_MAC_CYRILLIC

      Codepage for Macintosh Cyrillic (Java: MacCyrillic)

      See Also:
    • CP_MAC_CHINESE_SIMPLE

      public static final int CP_MAC_CHINESE_SIMPLE

      Codepage for Macintosh Chinese Simplified (Java: unknown - use EUC_CN, ISO2022_CN_GB, MS936 or cp935)

      See Also:
    • CP_MAC_ROMANIA

      public static final int CP_MAC_ROMANIA

      Codepage for Macintosh Romanian (Java: MacRomania)

      See Also:
    • CP_MAC_UKRAINE

      public static final int CP_MAC_UKRAINE

      Codepage for Macintosh Ukrainian (Java: MacUkraine)

      See Also:
    • CP_MAC_THAI

      public static final int CP_MAC_THAI

      Codepage for Macintosh Thai (Java: MacThai)

      See Also:
    • CP_MAC_CENTRAL_EUROPE

      public static final int CP_MAC_CENTRAL_EUROPE

      Codepage for Macintosh Central Europe (Latin-2) (Java: MacCentralEurope)

      See Also:
    • CP_MAC_ICELAND

      public static final int CP_MAC_ICELAND

      Codepage for Macintosh Iceland (Java: MacIceland)

      See Also:
    • CP_MAC_TURKISH

      public static final int CP_MAC_TURKISH

      Codepage for Macintosh Turkish (Java: MacTurkish)

      See Also:
    • CP_MAC_CROATIAN

      public static final int CP_MAC_CROATIAN

      Codepage for Macintosh Croatian (Java: MacCroatian)

      See Also:
    • CP_US_ACSII

      public static final int CP_US_ACSII

      Codepage for US-ASCII

      See Also:
    • CP_KOI8_R

      public static final int CP_KOI8_R

      Codepage for KOI8-R

      See Also:
    • CP_ISO_8859_1

      public static final int CP_ISO_8859_1

      Codepage for ISO-8859-1

      See Also:
    • CP_ISO_8859_2

      public static final int CP_ISO_8859_2

      Codepage for ISO-8859-2

      See Also:
    • CP_ISO_8859_3

      public static final int CP_ISO_8859_3

      Codepage for ISO-8859-3

      See Also:
    • CP_ISO_8859_4

      public static final int CP_ISO_8859_4

      Codepage for ISO-8859-4

      See Also:
    • CP_ISO_8859_5

      public static final int CP_ISO_8859_5

      Codepage for ISO-8859-5

      See Also:
    • CP_ISO_8859_6

      public static final int CP_ISO_8859_6

      Codepage for ISO-8859-6

      See Also:
    • CP_ISO_8859_7

      public static final int CP_ISO_8859_7

      Codepage for ISO-8859-7

      See Also:
    • CP_ISO_8859_8

      public static final int CP_ISO_8859_8

      Codepage for ISO-8859-8

      See Also:
    • CP_ISO_8859_9

      public static final int CP_ISO_8859_9

      Codepage for ISO-8859-9

      See Also:
    • CP_ISO_2022_JP1

      public static final int CP_ISO_2022_JP1

      Codepage for ISO-2022-JP

      See Also:
    • CP_ISO_2022_JP2

      public static final int CP_ISO_2022_JP2

      Another codepage for ISO-2022-JP

      See Also:
    • CP_ISO_2022_JP3

      public static final int CP_ISO_2022_JP3

      Yet another codepage for ISO-2022-JP

      See Also:
    • CP_ISO_2022_KR

      public static final int CP_ISO_2022_KR

      Codepage for ISO-2022-KR

      See Also:
    • CP_EUC_JP

      public static final int CP_EUC_JP

      Codepage for EUC-JP

      See Also:
    • CP_EUC_KR

      public static final int CP_EUC_KR

      Codepage for EUC-KR

      See Also:
    • CP_GB2312

      public static final int CP_GB2312

      Codepage for GB2312

      See Also:
    • CP_GB18030

      public static final int CP_GB18030

      Codepage for GB18030

      See Also:
    • CP_US_ASCII2

      public static final int CP_US_ASCII2

      Another codepage for US-ASCII

      See Also:
    • CP_UTF8

      public static final int CP_UTF8

      Codepage for UTF-8

      See Also:
    • CP_UNICODE

      public static final int CP_UNICODE

      Codepage for Unicode

      See Also:
  • Constructor Details

    • CodePageUtil

      public CodePageUtil()
  • Method Details

    • getBytesInCodePage

      public static byte[] getBytesInCodePage(String string, int codepage) throws UnsupportedEncodingException
      Converts a string into bytes, in the equivalent character encoding to the supplied codepage number.
      Parameters:
      string - The string to convert
      codepage - The codepage number
      Throws:
      UnsupportedEncodingException
    • getStringFromCodePage

      public static String getStringFromCodePage(byte[] string, int codepage) throws UnsupportedEncodingException
      Converts the bytes into a String, based on the equivalent character encoding to the supplied codepage number.
      Parameters:
      string - The byte of the string to convert
      codepage - The codepage number
      Throws:
      UnsupportedEncodingException
    • getStringFromCodePage

      public static String getStringFromCodePage(byte[] string, int offset, int length, int codepage) throws UnsupportedEncodingException
      Converts the bytes into a String, based on the equivalent character encoding to the supplied codepage number.
      Parameters:
      string - The byte of the string to convert
      codepage - The codepage number
      Throws:
      UnsupportedEncodingException
    • codepageToEncoding

      public static String codepageToEncoding(int codepage) throws UnsupportedEncodingException

      Turns a codepage number into the equivalent character encoding's name (in Java NIO canonical naming format).

      Parameters:
      codepage - The codepage number
      Returns:
      The character encoding's name. If the codepage number is 65001, the encoding name is "UTF-8". All other positive numbers are mapped to their Java NIO names, normally either "windows-" followed by the number, eg "windows-1251", or "cp" followed by the number, e.g. if the codepage number is 1252 the returned character encoding name will be "cp1252".
      Throws:
      UnsupportedEncodingException - if the specified codepage is less than zero.
    • codepageToEncoding

      public static String codepageToEncoding(int codepage, boolean javaLangFormat) throws UnsupportedEncodingException

      Turns a codepage number into the equivalent character encoding's name, in either Java NIO or Java Lang canonical naming.

      Parameters:
      codepage - The codepage number
      javaLangFormat - Should Java Lang or Java NIO naming be used?
      Returns:
      The character encoding's name, in either Java Lang format (eg Cp1251, ISO8859_5) or Java NIO format (eg windows-1252, ISO-8859-9)
      Throws:
      UnsupportedEncodingException - if the specified codepage is less than zero.
      See Also:
    • cp950ToString

      public static String cp950ToString(byte[] data, int offset, int lengthInBytes)
      This tries to convert a LE byte array in cp950 (Microsoft's dialect of Big5) to a String. We know MS zero-padded ascii, and we drop those. There may be areas for improvement in this.
      Parameters:
      data -
      offset -
      lengthInBytes -
      Returns:
      Decoded String