Chapter 6 - Character Set Translation

This chapter describes how you can define how the assembler is to perform translation between EBCDIC, ASCII and Unicode character sets.

Code Points and Code Pages

There are numerous ways to define which code points (numeric values) are assigned to which characters. Each of these schemes can be called a code page. IBM has assigned numbers to many of the different code pages.

Since the Tachyon assemblers can read and write files containing characters encoded in EBCDIC, ASCII and/or Unicode, the assembler needs to know how to translate between these character sets. The CODEPAGE option is used to tell the assembler in which EBCDIC and ASCII code pages it is to assume that the characters are encoded. In any assembly one EBCDIC and one ASCII code page will be used.

The assembler supports over 90 different Single Byte Character Set (SBCS) code pages. All of the EBCDIC and ASCII code pages are defined in terms of their translation to and from Unicode. When translating between EBCDIC and ASCII, characters are effectively converted first to Unicode and then to the target code page.

Normally any two given code pages will not define the same set of 256 characters, so usually some characters cannot be translated between the code pages. The assembler requires that all of the characters in the IBM High Level Assembler’s Standard Character Set must be translatable between the selected pair of EBCDIC and ASCII code pages. All but three of the characters (the national characters) must translate to their usual code points. These characters are the uppercase letters A-Z, the lowercase letters a-z and the digits 0-9 as well as the following:

blank & ' ( ) * + , - . / : = _
ASCII 20 26 27 28 29 2A 2B 2C 2D 2E 2F 3A 3D 5F
EBCDIC 40 50 7D 4D 5D 5C 4E 6B 60 4B 61 7A 7E 6D
The national characters at EBCDIC code points X'5B', X'7B' and X'7C' must also be translatable to ASCII. In code page 37 (EBCDIC USA), these code points correspond to the $, # and @ characters.

	blank	&	'	(	)	*	+	,	-	.	/	:	=	_
ASCII	20	26	27	28	29	2A	2B	2C	2D	2E	2F	3A	3D	5F
EBCDIC	40	50	7D	4D	5D	5C	4E	6B	60	4B	61	7A	7E	6D

Note: Some EBCDIC code pages such as 290 (EBCDIC Katakana), 803 (EBCDIC Hebrew) and 1030 (EBCDIC Katakana Extended) do not define the lowercase letters a-z to their normal code points. These code pages are not usable by the assembler.

CODEPAGE Option

IBM’s High Level Assembler uses the CODEPAGE option to define the EBCDIC code page of the source files. It uses this code page information only to translate EBCDIC characters to Unicode in CU constants and literals. The Tachyon assemblers support an extended CODEPAGE option to specify both an EBCDIC and an ASCII code page.

The CODEPAGE option is specified as CODEPAGE(ebcdic,ascii,list) where ebcdic is an EBCDIC code page number, ascii is an ASCII code page number, and list is either LIST or NOLIST. The code page numbers may be specified as either decimal numbers or their hexadecimal equivalents using the X'hex' notation. When setting the CODEPAGE option, the EBCDIC code page must be specified. If the ASCII code page number is omitted, the default is 819 (ISO-8859-1 Latin-1). If the list option is omitted, the default is NOLIST. If LIST is specified, the resulting translation between the EBCDIC, ASCII and Unicode code points will be displayed in the assembly listing.

The default for the CODEPAGE option is CODEPAGE(1047,819,NOLIST). These code pages translate all 256 code points between EBCDIC and ASCII. These are also the default EBCDIC and ASCII code pages for z/OS UNIX Systems Services, Tachyon File Tools and the Tachyon Operating System. However, the default EBCDIC code page is different from High Level Assembler’s default of CODEPAGE(1148).

The following shows the output of the CODEPAGE(1047,819,LIST) option:

CodePage(1047,819)  EBCDIC/ASCII/Unicode Translation-
00/00/0000     01/01/0001     02/02/0002     03/03/0003     04/9C/009C     05/09/0009     06/86/0086     07/7F/007F
08/97/0097     09/8D/008D     0A/8E/008E     0B/0B/000B     0C/0C/000C     0D/0D/000D     0E/0E/000E     0F/0F/000F
10/10/0010     11/11/0011     12/12/0012     13/13/0013     14/9D/009D     15/85/0085     16/08/0008     17/87/0087
18/18/0018     19/19/0019     1A/92/0092     1B/8F/008F     1C/1C/001C     1D/1D/001D     1E/1E/001E     1F/1F/001F
20/80/0080     21/81/0081     22/82/0082     23/83/0083     24/84/0084     25/0A/000A     26/17/0017     27/1B/001B
28/88/0088     29/89/0089     2A/8A/008A     2B/8B/008B     2C/8C/008C     2D/05/0005     2E/06/0006     2F/07/0007
30/90/0090     31/91/0091     32/16/0016     33/93/0093     34/94/0094     35/95/0095     36/96/0096     37/04/0004
38/98/0098     39/99/0099     3A/9A/009A     3B/9B/009B     3C/14/0014     3D/15/0015     3E/9E/009E     3F/1A/001A
40/20/0020     41/A0/00A0     42/E2/00E2     43/E4/00E4     44/E0/00E0     45/E1/00E1     46/E3/00E3     47/E5/00E5
48/E7/00E7     49/F1/00F1     4A/A2/00A2     4B/2E/002E     4C/3C/003C     4D/28/0028     4E/2B/002B     4F/7C/007C
50/26/0026     51/E9/00E9     52/EA/00EA     53/EB/00EB     54/E8/00E8     55/ED/00ED     56/EE/00EE     57/EF/00EF
58/EC/00EC     59/DF/00DF     5A/21/0021     5B/24/0024     5C/2A/002A     5D/29/0029     5E/3B/003B     5F/5E/005E
60/2D/002D     61/2F/002F     62/C2/00C2     63/C4/00C4     64/C0/00C0     65/C1/00C1     66/C3/00C3     67/C5/00C5
68/C7/00C7     69/D1/00D1     6A/A6/00A6     6B/2C/002C     6C/25/0025     6D/5F/005F     6E/3E/003E     6F/3F/003F
70/F8/00F8     71/C9/00C9     72/CA/00CA     73/CB/00CB     74/C8/00C8     75/CD/00CD     76/CE/00CE     77/CF/00CF
78/CC/00CC     79/60/0060     7A/3A/003A     7B/23/0023     7C/40/0040     7D/27/0027     7E/3D/003D     7F/22/0022
80/D8/00D8     81/61/0061     82/62/0062     83/63/0063     84/64/0064     85/65/0065     86/66/0066     87/67/0067
88/68/0068     89/69/0069     8A/AB/00AB     8B/BB/00BB     8C/F0/00F0     8D/FD/00FD     8E/FE/00FE     8F/B1/00B1
90/B0/00B0     91/6A/006A     92/6B/006B     93/6C/006C     94/6D/006D     95/6E/006E     96/6F/006F     97/70/0070
98/71/0071     99/72/0072     9A/AA/00AA     9B/BA/00BA     9C/E6/00E6     9D/B8/00B8     9E/C6/00C6     9F/A4/00A4
A0/B5/00B5     A1/7E/007E     A2/73/0073     A3/74/0074     A4/75/0075     A5/76/0076     A6/77/0077     A7/78/0078
A8/79/0079     A9/7A/007A     AA/A1/00A1     AB/BF/00BF     AC/D0/00D0     AD/5B/005B     AE/DE/00DE     AF/AE/00AE
B0/AC/00AC     B1/A3/00A3     B2/A5/00A5     B3/B7/00B7     B4/A9/00A9     B5/A7/00A7     B6/B6/00B6     B7/BC/00BC
B8/BD/00BD     B9/BE/00BE     BA/DD/00DD     BB/A8/00A8     BC/AF/00AF     BD/5D/005D     BE/B4/00B4     BF/D7/00D7
C0/7B/007B     C1/41/0041     C2/42/0042     C3/43/0043     C4/44/0044     C5/45/0045     C6/46/0046     C7/47/0047
C8/48/0048     C9/49/0049     CA/AD/00AD     CB/F4/00F4     CC/F6/00F6     CD/F2/00F2     CE/F3/00F3     CF/F5/00F5
D0/7D/007D     D1/4A/004A     D2/4B/004B     D3/4C/004C     D4/4D/004D     D5/4E/004E     D6/4F/004F     D7/50/0050
D8/51/0051     D9/52/0052     DA/B9/00B9     DB/FB/00FB     DC/FC/00FC     DD/F9/00F9     DE/FA/00FA     DF/FF/00FF
E0/5C/005C     E1/F7/00F7     E2/53/0053     E3/54/0054     E4/55/0055     E5/56/0056     E6/57/0057     E7/58/0058
E8/59/0059     E9/5A/005A     EA/B2/00B2     EB/D4/00D4     EC/D6/00D6     ED/D2/00D2     EE/D3/00D3     EF/D5/00D5
F0/30/0030     F1/31/0031     F2/32/0032     F3/33/0033     F4/34/0034     F5/35/0035     F6/36/0036     F7/37/0037
F8/38/0038     F9/39/0039     FA/B3/00B3     FB/DB/00DB     FC/DC/00DC     FD/D9/00D9     FE/DA/00DA     FF/9F/009F

Each translatable EBCDIC character is displayed as a group of three code points. The first code point is for the selected EBCDIC code page, the second is for the selected ASCII code page and the third is the Unicode code point. If the EBCDIC character cannot be translated to ASCII, the ASCII code point will be listed as --.

EBCDIC Code Pages

Code Page	Description
00037	EBCDIC USA, Canada, Australia, New Zealand, Netherlands, Brazil, Portugal
00264	EBCDIC Print Train and Text Processing
00273	EBCDIC Austria, Germany
00274	EBCDIC Belgium
00275	EBCDIC Brazil
00277	EBCDIC Denmark, Norway
00278	EBCDIC Finland, Sweden
00280	EBCDIC Italy
00281	EBCDIC Japanese English
00284	EBCDIC Spanish
00285	EBCDIC United Kingdom
00293	EBCDIC APL
00297	EBCDIC France
00420	EBCDIC Arabic
00423	EBCDIC Greek
00424	EBCDIC Hebrew
00500	EBCDIC Latin-1
00838	EBCDIC Thai
00870	EBCDIC Latin-2
00871	EBCDIC Iceland
00875	EBCDIC Greek
00880	EBCDIC Cyrillic
00924	EBCDIC Latin-9
01005	EBCDIC Isomophic Text Communication
01025	EBCDIC Russian
01026	EBCDIC Turkey
01027	EBCDIC Japanese (Latin) Extended
01031	EBCDIC Japanese (Latin) Extended
01047	EBCDIC Latin-1
01122	EBCDIC Estonia
01123	EBCDIC Ukraine
01130	EBCDIC Vietnamese
01140	EBCDIC USA, Canada, Australia, New Zealand, Netherlands
01141	EBCDIC Austria, Germany
01142	EBCDIC Denmark, Norway
01143	EBCDIC Finland, Sweden
01144	EBCDIC Italy
01145	EBCDIC Spanish
01146	EBCDIC United Kingdom
01147	EBCDIC France
01148	EBCDIC Latin-1
01149	EBCDIC Iceland
01153	EBCDIC Latin-2
01154	EBCDIC Cyrillic
01155	EBCDIC Turkey
01156	EBCDIC Baltic
01157	EBCDIC Estonia
01158	EBCDIC Ukraine
01160	EBCDIC Thai
01164	EBCDIC Vietnamese
01165	EBCDIC Latin-2

ASCII Code Pages

Code Page	Description
00367	US-ASCII-7
00437	DOS USA
00720	DOS Arabic
00737	DOS Greek
00775	DOS Baltic
00813	ISO-8859-7 Greek
00819	ISO-8859-1 Latin-1 Western European
00850	DOS Latin-1
00852	DOS Latin-2
00855	DOS Cyrillic
00856	DOS Hebrew
00857	DOS Turkish
00858	DOS Latin-1 + Euro
00860	DOS Portuguese
00861	DOS Icelandic
00862	DOS Israel
00863	DOS French Canadian
00864	DOS Arabic
00865	DOS Nordic
00866	DOS Russian
00869	DOS Greek
00874	ISO-8859-11 Thai
00878	KOI8-R Russian
00907	ASCII APL
00910	DOS APL
00912	ISO-8859-2 Latin-2 Eastern European
00913	ISO-8859-3 Latin-3 Southern European
00914	ISO-8859-4 Latin-4 Northern European
00915	ISO-9959-5 Cyrillic
00916	ISO-8859-8 Hebrew
00919	ISO-8859-10 Latin-6 Nordic
00920	ISO-8859-9 Latin-5 Turkish
00921	ISO-8859-13 Latin-7 Baltic
00923	ISO-8859-15 Latin-9
01006	DOS Urdu
01089	ISO-8859-6 Arabic
01139	ASCII Japanese Alphanumeric Katakana
01250	Windows Latin-2
01251	Windows Cyrillic
01252	Windows Latin-1
01253	Windows Greek
01254	Windows Latin-5 Turkish
01255	Windows Hebrew
01256	Windows Arabic
01257	Windows Baltic
01258	Windows Vietnamese