使用 PHP 转换所有类型的智能引号
可以使用以下代码行,其中预期为 UTF-8 输入。
$chr_map = array( // Windows codepage 1252 "\xC2\x82" => "'", // U+0082⇒U+201A single low-9 quotation mark "\xC2\x84" => '"', // U+0084⇒U+201E double low-9 quotation mark "\xC2\x8B" => "'", // U+008B⇒U+2039 single left-pointing angle quotation mark "\xC2\x91" => "'", // U+0091⇒U+2018 left single quotation mark "\xC2\x92" => "'", // U+0092⇒U+2019 right single quotation mark "\xC2\x93" => '"', // U+0093⇒U+201C left double quotation mark "\xC2\x94" => '"', // U+0094⇒U+201D right double quotation mark "\xC2\x9B" => "'", // U+009B⇒U+203A single right-pointing angle quotation mark // Regular Unicode // U+0022 quotation mark (") // U+0027 apostrophe (') "\xC2\xAB" => '"', // U+00AB left-pointing double angle quotation mark "\xC2\xBB" => '"', // U+00BB right-pointing double angle quotation mark "\xE2\x80\x98" => "'", // U+2018 left single quotation mark "\xE2\x80\x99" => "'", // U+2019 right single quotation mark "\xE2\x80\x9A" => "'", // U+201A single low-9 quotation mark "\xE2\x80\x9B" => "'", // U+201B single high-reversed-9 quotation mark "\xE2\x80\x9C" => '"', // U+201C left double quotation mark "\xE2\x80\x9D" => '"', // U+201D right double quotation mark "\xE2\x80\x9E" => '"', // U+201E double low-9 quotation mark "\xE2\x80\x9F" => '"', // U+201F double high-reversed-9 quotation mark "\xE2\x80\xB9" => "'", // U+2039 single left-pointing angle quotation mark "\xE2\x80\xBA" => "'", // U+203A single right-pointing angle quotation mark ); $char_val = array_keys ($chr_map); // but: for efficiency you should $rpl = array_values($chr_map); // pre-calculate these two arrays $str = str_replace($char_val, $rpl, html_entity_decode($str, ENT_QUOTES, "UTF-8"));
解释
每个 Unicode 字符都属于一个“通用类别”。
在这些通用类别中,可以包含引号字符的通用类别为 −
Ps "Punctuation, Open" Pe "Punctuation, Close" Pi "Punctuation, Initial quote (might behave like Ps or Pe depending on its usage)" Pf "Punctuation, Final quote (might behave like Ps or Pe depending on its usage)" Po "Punctuation, Other"
如果用户不确定输入是否为 UTF-8 编码,可以在执行其他操作之前放置以下代码 −
if ( !preg_match('/^\X*$/u', $str)) { $str = utf8_encode($str); }
如果需要将数据标准化为 0x80-0x9F 范围,可以使用以下代码 −
$normalization_map = array( "\xC2\x80" => "\xE2\x82\xAC", // U+20AC Euro sign "\xC2\x82" => "\xE2\x80\x9A", // U+201A single low-9 quotation mark "\xC2\x83" => "\xC6\x92", // U+0192 latin small letter f with hook "\xC2\x84" => "\xE2\x80\x9E", // U+201E double low-9 quotation mark "\xC2\x85" => "\xE2\x80\xA6", // U+2026 horizontal ellipsis "\xC2\x86" => "\xE2\x80\xA0", // U+2020 dagger "\xC2\x87" => "\xE2\x80\xA1", // U+2021 double dagger "\xC2\x88" => "\xCB\x86", // U+02C6 modifier letter circumflex accent "\xC2\x89" => "\xE2\x80\xB0", // U+2030 per mille sign "\xC2\x8A" => "\xC5\xA0", // U+0160 latin capital letter s with caron "\xC2\x8B" => "\xE2\x80\xB9", // U+2039 single left-pointing angle quotation mark "\xC2\x8C" => "\xC5\x92", // U+0152 latin capital ligature oe "\xC2\x8E" => "\xC5\xBD", // U+017D latin capital letter z with caron "\xC2\x91" => "\xE2\x80\x98", // U+2018 left single quotation mark "\xC2\x92" => "\xE2\x80\x99", // U+2019 right single quotation mark "\xC2\x93" => "\xE2\x80\x9C", // U+201C left double quotation mark "\xC2\x94" => "\xE2\x80\x9D", // U+201D right double quotation mark "\xC2\x95" => "\xE2\x80\xA2", // U+2022 bullet "\xC2\x96" => "\xE2\x80\x93", // U+2013 en dash "\xC2\x97" => "\xE2\x80\x94", // U+2014 em dash "\xC2\x98" => "\xCB\x9C", // U+02DC small tilde "\xC2\x99" => "\xE2\x84\xA2", // U+2122 trade mark sign "\xC2\x9A" => "\xC5\xA1", // U+0161 latin small letter s with caron "\xC2\x9B" => "\xE2\x80\xBA", // U+203A single right-pointing angle quotation mark "\xC2\x9C" => "\xC5\x93", // U+0153 latin small ligature oe "\xC2\x9E" => "\xC5\xBE", // U+017E latin small letter z with caron "\xC2\x9F" => "\xC5\xB8", // U+0178 latin capital letter y with diaeresis ); $chr = array_keys ($normalization_map); // but: for efficiency you should $rpl = array_values($normalization_map); // pre-calculate these two arrays $str = str_replace($chr, $rpl, $str);
广告