Confuse recently on map of animal changing evil spirit,The solution that defeat solution presses the issue that revises quite amused ~ ? to have a file head,Doing half talent to know is Utf-8 code . so..Do these at long last was clear about .
Simple for,Unicode,Gbk and big five values that code namely,And Utf-8, uft-16 and so on is the expressional form . of this valueAnd in front those three kinds of encode are one compatible,Same a Chinese character,Value of those three codes is completely different .The Uncode value that is like " Chinese " and Gbk are different,Hypothesis Uncode is A040,Gbk is B030,And Uft-8 is piled up,It is the formal . that behaves that valueUtf-8 code is organized completely only in the light of Uncode,If GBK wants to turn,UTF-8 must turn first Uncode is piled up,Turn again Utf-8 with respect to OK .
Detailed see below this article . that turn
Talk about Unicode to code,The noun such as brief explanation UCS, UTF, BMP, BOM
This is the interest books that a programmer writes programmer.Alleged interest is to point to can realize a few not clear before ideas easily,Promotional knowledge,Be similar to those who play RPG game to upgrade.The motive that arranges this article is two problems:
Problem one:
Of use Windows notebook " put additionally for " ,Can mix in GBK, Unicode, Unicode Big Endian UTF-8 is changed each other between means of these a few kinds of encode.It is Txt file likewise,Windows is how identify encode way?
The begin of the Txt document that Unicode, Unicode Big Endian and UTF-8 encode discover before my early will be much give a few byte,It is FF, FE respectively (Unicode) , FE, FF (Unicode Big Endian) , EF, BB, BF (UTF-8) .But these mark are to be based on what standard?
Problem two:
A ConvertUTF.c sees on the net recently,Realized UTF-32, UTF-16 and UTF-8 the mutual changeover of means of these three kinds of encode.To Unicode(UCS2) , GBK, UTF-8 these code means,I understand so.But this program makes me a little muddleheaded,Be unable to call to mind what do UTF-16 and UCS2 have to concern.
Examine relevant data,Cleared up these problems at long last,The detail that conveniently also knew a few Unicode.Write into an article,Send had had similar interrogative friend.The article is accomplished as far as possible when writing straightaway,But requirement reader knows what is byte,What is hexadecimal.
0, Big Endian and Little EndianThe different kind that Big Endian and Little Endian are number of CPU processing much byte.For example " Chinese " the Unicode encode of the word is 6C49.So when be being written in the file,It is after all write 6C in front,Or will 49 write in front?If write 6C in front,It is Big Endian.If will 49 write in front,It is Little Endian.
" Endian " this word out " Glenn Buddha travel notes " .When eating an egg, Lilliputian civil war results from is after all from big head (Big-Endian) is knocked or as a child first (Little-Endian) is knocked,Ever had happened six times from this revolting,An emperor sent an order,Another lost kingship.
We translate Endian into commonly " byte foreword " ,Call Big Endian and Little Endian %26quot; large remaining part %26quot; and %26quot; small remaining part %26quot; .
1, character encode, internal code,Conveniently introduces Chinese character coding
The ability after character must code is handled by the computer.The internal code that the default that the computer uses codes means is the computer.Inchoate computer uses 7 ASCII to code,To handle a Chinese character,Programmer designed the GB2312 that is used at simplified Chinese and the Big5 that are used at traditional Chinese.
GB2312(1980 year) altogether collected 7445 character,Include 6763 Chinese characters and 682 other sign.The tall byte of internal code limits of Chinese character area from B0-F7,Low byte from A1-FE,The code that take up it is 72*94=6768.Having 5 room among them is D7FA-D7FE.
The Chinese character that GB2312 supports is too little.The Chinese character 1995 expands normative GBK1.0 collected 21886 symbols,Its cent is Chinese character area and graphic symbol area.Chinese character area includes 21003 character.
To GBK from ASCII, GB2312,These encode methods are to be down compatible,Namely same character always has same code in these plan,The standard from the back supports more character.In these encode,English and Chinese are handled OK and unifiedly.The method that divisional Chinese codes is tall byte is the most exalted do not be 0.According to the appellation of programmer,GB2312, GBK belongs to double byte character set (DBCS) .
The GB18030 2000 is the official state level that replaces GBK1.0.This standard collected 27484 Chinese characters,Still collected Tibetan article, unconscious article, dimension at the same time the main minority character such as my Er article.Say from Chinese character glossary,6582 Chinese characters that GB18030 increased CJK patulous A on the foundation of 20902 Chinese characters of GB13000.1 (Unicode piles up 0x3400-0x4db5) ,Altogether collected 27484 Chinese characters.
The meaning that CJK is Sino-Japanese Han.Unicode is piled up to save,language of Sino-Japanese Han the Three Kingdoms medium character integrates code.The Chinese edition that GB13000.1 is ISO/IEC 10646-1,Be equivalent to Unicode 1.1.
The encode of GB18030 adopts odd byte, double byte and 4 byte plan.Among them odd byte, double byte and GBK are completely compatible.The code that 4 byte code collected CJK namely 6582 Chinese characters of patulous A.For example:The 0x3400 of UCS codes mediumly in GB18030 should be 8139EF30,The 0x3401 of UCS codes mediumly in GB18030 should be 8139EF31.
Microsoft offerred GB18030 upgrade bag,But this upgrades the bag just offerred CJK of a support the new style of 6582 Chinese characters of patulous A:Body of new the Song Dynasty - 18030,Do not change an internal code.The internal code of Windows remains GBK.
There still are a few details here:
Of GB2312 textual still be an area code,From the area pile up an internal code,Need adds A0 respectively on tall byte and low byte.
To any character encode,The order of encode unit is by what encode plan appoints,Have nothing to do with Endian.For example the encode unit of GBK is byte,Show a Chinese character with two byte.The order of these two byte is fixed,The effect that does not get CPU byte order.The encode unit of UTF-16 is Word (double byte) ,The order between Word is encode plan appoint,The byte of Word interior arranges ability to be able to get the influence of Endian.UTF-16 still can introduce from the back.
Of two byte of GB2312 the most exalted it is 1.But the code that accords with this condition have 128*128=16384 only.The low byte of GBK and GB18030 is the so most exalted may not be 1.What do not cross this not to affect DBCS character to flow is analytic:Reading when taking DBCS character to flow,Should encounter perch only for the byte of 1,Can regard a double byte as encode below two byte,And what is the perch that need not be in charge of low byte.
2, Unicode, UCS and UTF
Mention in front arrive from ASCII, GB2312, GBK the encode method of GB18030 is to be down compatible.And Unicode as compatible as ASCII (say well and truly,It is as compatible as ISO-8859-1) ,As incompatible as GB code.For example " Chinese " the Unicode encode of the word is 6C49,And GB code is BABA.
Unicode also is method of encode of a kind of character,Nevertheless it is by international organization design,Can hold a whole world the encode plan of all language character.The formal name of Unicode is %26quot;Universal Multiple-Octet Coded Character Set%26quot; ,Abbreviation is UCS.UCS can regard the abbreviate that is %26quot;Unicode Character Set%26quot; .
According to dimension base encyclopedia (the account of Http://zh.wikipedia.org/wiki/) :The organization that is put in two to try to design Unicode independently on the history,Namely international Organization for Standardization (ISO) the association with manufacturer of a software (Unicode.org) .ISO developed ISO 10646 project,Unicode association developed Unicode project.
1991 around,Both sides realises the world does not need two incompatible character set.The working achievement that then they begin amalgamative both sides,An onefold encode is expressed and work in coordination to found.Begin from Unicode2.0,Unicode project used as identical as ISO 10646-1 font and word code.
At present two projects still exist,Independent ground announces respective standard.Unicode association the Unicode 4.1.0 that present newest version was 2005.The newest standard of ISO is ISO 10646-3:2003.
UCS is how the regulation codes only,How to transmit without the regulation, save this encode.For example " Chinese " the UCS encode of the word is 6C49,I can be transmitted with 4 Ascii number, save this encode;Also can code with Utf-8: 3Successive byte E6 B1 89 will express it.The key is wanting to approbate at communication both sides.UTF-8, UTF-7, UTF-16 is the plan that is accepted extensively.A of UTF-8 extraordinary gain is it and ISO-8859-1 completely compatible.UTF is %26quot; UCS Transformation Format %26quot; abbreviate.
The RFC2781 of IETF and RFC3629 with the consistent style of RFC,Clear, lively the encode method that did not break rigorous ground to describe UTF-16 and UTF-8 again.I always write down the abbreviate that getting IETF is Internet Engineering Task Force.But the foundation that the RFC that IETF is in charge of safeguarding is all standards on Internet.
2.1, internal code and Code Page
At present the kernel of Windows has supported Unicode character set,Can support a whole world on the kernel so all language characters.But because showed some much programs and documentation to use some,plant the encode of specific language,For example GBK,Windows does not support existing encode impossibly,And convert entirely Unicode.
Page of Windows use code (Code Page) will get used to each countries and area.Code Page can is by understanding in front mentioned internal code.The Code Page of GBK correspondence is CP936.
Microsoft also defined Code Page for GB18030:CP54936.But because GB18030 has one share,4 byte code,And the code page of Windows supports monomial section and double byte encode only,So this Code Page is cannot use truly.
3, UCS-2, UCS-4, BMP
UCS has two kinds of patterns:UCS-2 and UCS-4.Just as its name implies,UCS-2 codes with two byte namely,UCS-4 uses 4 byte namely (used 31 only actually,The most exalted must be 0) encode.We let make a few simple maths game below:
UCS-2 has 2^16=65536 code,UCS-4 has 2^31=2147483648 code.
UCS-4 basis is the most exalted divide into 2^7=128 for the highest byte of 0 Group.Every Group again basis second tall byte cent is 256 Plane.Every Plane is 256 according to cent of the 3rd byte (Rows) ,Every luggage contains 256 Cells.The Cells of same of course travel is the last byte differs only,The others is same.
The Plane 0 of Group 0 is called Basic Multilingual Plane, namely BMP.In perhaps saying UCS-4,Two tall byte is the code of 0 be called BMP.
Got two before the BMP take out of UCS-4 zero byte UCS-2.On two zero byte is added before two byte of UCS-2,The BMP that got UCS-4.And still be allocated to be besides BMP without any character in current UCS-4 standard.
4, UTF encode
UTF-8 is unit to have code to UCS with 8 namely.Be as follows to the encode means of UTF-8 from UCS-2:
UCS-2 codes (16 into make) reduce expenditure of UTF-8 word (binary system)
0000 - 007F 0xxxxxxx
0080 - 07FF 110xxxxx 10xxxxxx
0800 - FFFF 1110xxxx 10xxxxxx 10xxxxxx
For example " Chinese " the Unicode encode of the word is 6C49.6C49 is between 0800-FFFF,Should use 3 byte pattern plate for certain so:1110xxxx 10xxxxxx 10xxxxxx.Writing 6C49 into the binary system is:0110 110001 001001,Flow with this bit ordinal the X in replacing pattern plate,Get:11100110 10110001 10001001,Namely E6 B1 89.
The code that the reader can check us with notebook is right.Need notices,UltraEdit is when the text file that opens Utf-8 encode can automatic changeover is UTF-16,The possibility arises promiscuous.You can put out this option in the setting.Better tool is Hex Workshop.
UTF-16 has code to UCS for unit with 16.To be less than the UCS of 0x10000 to pile up,16 when UTF-16 encode is equal to UCS to pile up correspondence do not have symbolic integer.To not be less than the UCS of 0x10000 to pile up,Defined an algorithm.The UCS2 that because be used actually,does not pass,Or the BMP of UCS4 is less than 0x10000 necessarily,So current and character,Can think UTF-16 and UCS-2 are basic and identical.But UCS-2 is an encode plan only,UTF-16 is used at actual transmission,The issue that must consider byte order so.
5, the byte foreword of UTF and BOM
UTF-8 is encode unit with byte,The problem that does not have byte foreword.UTF-16 is encode unit with two byte,Before explaining an UTF-16 text,The byte foreword that should clear up every encode unit above all.For example " Kui " Unicode encode is 594E," second " Unicode encode is 4E59.If we receive UTF-16 word reduce expenditure " 594E " ,So this is " Kui " or " second " ?
The mark byte that recommends in Unicode standard is sequential the method is BOM.BOM is not %26quot; Bill Of Material %26quot; BOM is expressed,However Byte Order Mark.The idea that BOM is a petty trick having a place:
In the character that there is to be called %26quot;ZERO WIDTH NO-BREAK SPACE%26quot; in UCS encode,Its encode is FEFF.And FFFE is nonexistent character in UCS,Should not appear in be transmitted actually so.Before UCS standard suggests we flow in transmission byte,Transmit character %26quot;ZERO WIDTH NO-BREAK SPACE%26quot; first.
Such if receiver receives FEFF,Show this word reduce expenditure is Big-Endian;If receive FFFE,Show this word reduce expenditure is Little-Endian.Accordingly character %26quot;ZERO WIDTH NO-BREAK SPACE%26quot; is called BOM again.
UTF-8 does not need BOM to show byte is ordinal,But can demonstrate encode pattern with BOM.The UTF-8 encode of character %26quot;ZERO WIDTH NO-BREAK SPACE%26quot; is EF BB BF (the reader can use the encode method test and verify that introduces before us) .If receiver is received,flow with the byte of begin of EF BB BF so,Knew this is UTF-8 encode.
Windows uses BOM to label namely of the encode means of text file.
6, farther reference material
The data that the article basically consults is %26quot;Short Overview Of ISO-IEC 10646 And Unicode%26quot; (http://www.nada.kth.se/i18n/ucs/unicode-iso10646-oview.html) .
I still looked for two to look good data,The question that begins because of me nevertheless found the solution,Did not look so:
%26quot;Understanding Unicode A General Introduction To The Unicode Standard%26quot; (http://scripts.sil.org/cms/scripts/page.php? Site_id=nrsi%26amp;item_id=IWS-Chapter04a)
%26quot;Character Set Encoding Basics Understanding Character Set Encodings And Legacy Encodings%26quot; (http://scripts.sil.org/cms/scripts/page.php? Site_id=nrsi%26amp;item_id=IWS-Chapter03)
The software that I had written UTF-8, UCS-2, GBK to be changed each other is wrapped,Include to use Windows API and the version that do not use Windows API.If having time later,I can arrange one transfer to a lower level to arrive on my individual homepage (Http://fmddlmyy.home4u.china.com) .
I am the ability after wanting to understand all problems begin to write this article,It is good to think a little while to be able to be written formerly.Did not think of to considered expression and check detail to be cost for a long time,Actually from afternoon 1:30 write 9:00.The hope has a reader to be able to be benefited from which.
Appendix 1 says a division again code, GB2312, internal code and code page
Some friends still have doubt to this word in the article:
" of GB2312 textual still be an area code,From the area pile up an internal code,Need adds A0 respectively on tall byte and low byte."
I explain in detail again:
%26quot; of GB2312 textual %26quot; it is to point to a country a standard 1980 %26quot; exchange of information of level of state of People's Republic of China uses character set of Chinese character coding primary market GB 2312-80 %26quot; .This standard codes with two number Chinese character and Chinese symbol.The first number is called " area " ,The second number is called " " .Also call an area so code.1-9 area is Chinese symbol,16-55 area is one class Chinese character,56-87 area is two class Chinese character.Windows still also has an area now input method,Input for example 1601 get " ah " .(this area the input method can identify automatically 16 mix into the GB2312 that make 10 into the division that make code,Input B0A1 to be able to get likewise that is to say " ah " .)
The internal code is the character encode that points to operating system interior.The internal code of inchoate operating system is related to the language.Present Windows supports Unicode in systematic interior,Get used to all sorts of languages with code page next," internal code " the concept blurred quite.The code that Microsoft assigns default code page commonly says is an internal code.
Internal code this vocabulary,The definition that does not have what government,Code page also is Microsoft only of this company make a way.As programmer,We should know they are what thing only,Without textual research of necessary and overmuch ground these nouns.
Alleged code page (the character that Code Page) is aimed at character of a kind of language namely codes.For example the Code Page of GBK is CP936,The Code Page of BIG5 is CP950,The Code Page of GB2312 is CP20936.
The idea that there is default code page in Windows,Be short of spare namely what encode will explain character.For example the notebook of Windows opened file of a text,The content inside is word reduce expenditure:BA, BA, D7, D6.How Windows should explain it?
It is according to Unicode encode explains, still explain according to GBK, still explain according to BIG5,Still explain according to ISO8859-1?If press GBK to explain,Can get " Chinese character " two words.According to other encode explanation,The character that cannot find correspondence possibly,The character that finds an error possibly also.Alleged " mistake " it is to show the original idea with text author nots agree with,Generated random code at this moment.
The answer is Windows go explaining the byte in text file flows according to current default code page.Default code page can be installed through the area option of Control Panel.Of notebook put additionally there is an ANSI in be,It is the encode method according to default code page is saved actually.
The internal code of Windows is Unicode,It can support many code page at the same time on the technology.Want a file to be able to explain what he uses to code only,The code page that the user installed correspondence again,Windows can show correctly,Can appoint Charset in HTML file for example.
Some HTML file authors,Especially English author,Think the everybody on the world uses English,Do not appoint Charset in the file.If he used the character between 0x80-0xff,Chinese Windows according to default GBK goes explaining,Can appear to be piled up in disorder.Want to impose the sentence that appoints Charset in this Html file only at this moment,For example:
%26lt;meta Http-equiv=%26quot;Content-Type%26quot; Content=%26quot;text/html; Charset=ISO8859-1%26quot;%26gt;
If former writer is used code page and ISO8859-1 are compatible,Won't appear to be piled up in disorder.
Say a division again code,Ah area the code is 1601,Write into 16 into making is 0x10, 0x01.The ASCII code that this and computer use extensively conflicts.For the ASCII encode of compatible 00-7f,We are in area A0 is added respectively on the tall, low byte of the code.Such " ah " encode becomes B0A1.We also call the encode that has added two A0 GB2312 to code,Although of GB2312 textual and essential did not mention this.
...