# NAME

Lingua::JA::NormalizeText - All-in-One Japanese text normalizer

# SYNOPSIS

    use Lingua::JA::NormalizeText;
    use utf8;

    my @options = ( qw/nfkc decode_entities/, \&dearinsu_to_desu );
    my $normalizer = Lingua::JA::NormalizeText->new(@options);

    my $text = $normalizer->normalize('���������������������������&hearts;'); # => '���������������������������'

    sub dearinsu_to_desu
    {
        my $text = shift;
        $text =~ s/���������������/������/g;

        return $text;
    }

\# or

    use Lingua::JA::NormalizeText qw/old2new_kanji/;
    use utf8;

    my $text = old2new_kanji('���������'); # => '���������'

# DESCRIPTION

This module provides a lot of Japanese text normalization options.
These options facilitate Japanese text pre-processing.

# METHODS

## new(@options)

Creates a new Lingua::JA::NormalizeText instance.

The following options are available:

    OPTION                 SAMPLE INPUT           OUTPUT FOR SAMPLE INPUT
    ---------------------  ---------------------  -----------------------
    lc                     DdD                    ddd
    uc                     DdD                    DDD
    nfkc                   ������                     ��� (U+30AC)
    nfkd                   ������                     ������ (U+30AB. U+3099)
    nfc                    ���                     ��� (U+30C9)
    nfd                    ���                     ������ (U+30C8, U+3099)
    decode_entities        &hearts;               ���
    strip_html             <em>���</em>            ���
    alnum_z2h              ������������������           ABC123
    alnum_h2z              ABC123                 ������������������
    space_z2h              \x{3000}               \x{0020}
    space_h2z              \x{0020}               \x{3000}
    katakana_z2h           ������������               ������������
    katakana_h2z           ������������������������               ������������������������
    katakana2hiragana      ���������                 ���������
    hiragana2katakana      ���������                 ���������
    wave2tilde             ���, ���                 ���
    tilde2wave             ���                     ���
    wavetilde2long         ���, ���, ���             ���
    wave2long              ���, ���                 ���
    tilde2long             ���                     ���
    fullminus2long         ���                     ���
    dashes2long            ���                     ���
    drawing_lines2long     ���                     ���
    unify_long_repeats     ���������������             ���������
    nl2space               (LF)(CR)(CRLF}         (space)(space)(space)
    unify_nl               (LF)(CR)(CRLF)         \n\n\n
    unify_long_spaces      ���(space)(space)���     ���(space)���
    unify_whitespaces      \x{00A0}               (space)
    trim                   (space)���(space)���(space)  ���(space)���
    ltrim                  (space)���(space)       ���(space)
    rtrim                  ������(space)(space)     ������
    old2new_kana           ������������������           ������������������������
    old2new_kanji          ���������                 ���������
    tab2space              (tab)(tab)             (space)(space)
    remove_controls        ���\x{0000}���           ������
    remove_DFC             \x{202E}HOGE           HOGE
    remove_spaces          \x{0020}���\x{3000}���\x{0020}  ������
    dakuon_normalize       ���\x{3099}             ��� (U+3056)
    handakuon_normalize    ���\x{309A}             ��� (U+3071)
    all_dakuon_normalize   ���\x{3099}���\x{309A}   ������ (U+3056, U+3071)
    square2katakana        ���                     ���������
    circled2kana           ���������������             ���������������
    circled2kanji          ���������������             ���������������
    decompose_parenthesized_kanji  ���             (���)

The order in which these options are applied is according to the order of
the elements of @options.
(i.e., The first element is applied first, and the last element is applied last.)

External functions can be added.
(See dearinsu\_to\_desu function of the SYNOPSIS section.)

## normalize($text)

normalizes $text.

# OPTIONS

## lc, uc

These options are the same as CORE::lc and CORE::uc.

## nfkc, nfkd, nfc, nfd

See [Unicode::Normalize](https://metacpan.org/pod/Unicode::Normalize).

## decode\_entities

See [HTML::Entities](https://metacpan.org/pod/HTML::Entities).

## strip\_html

Strips HTML tags.

## alnum\_z2h, alnum\_h2z

Converts English alphabet, numbers and symbols ZENKAKU <-> HANKAKU.

ZENKAKU:

    ������������������������������������������������������������������������������������������
    ������������������������������������������������������������������������������������������
    ������������������������������������������������������������������������������������������
    ���������������������������������������

HANKAKU:

    !"#$%&'()*+,-./0123456789:;<=>
    ?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\
    ]^_`abcdefghijklmnopqrstuvwxyz
    {|}~���������������������

## space\_z2h, space\_h2z

SPACE (U+0020) <-> IDEOGRAPHIC SPACE (U+3000)

## katakana\_z2h, katakana\_h2z

Converts katakanas ZENKAKU <-> HANKAKU.

See [Lingua::JA::Regular::Unicode](https://metacpan.org/pod/Lingua::JA::Regular::Unicode).

## hiragana2katakana

INPUT:

    ������������������������������������������������������������������������������������������
    ������������������������������������������������������������������������������������������
    ������������������������������������������������������������������������������������

OUTPUT FOR INPUT:

    ������������������������������������������������������������������������������������������
    ������������������������������������������������������������������������������������������
    ������������������������������������������������������������������������������������

## katakana2hiragana

INPUT:

    ������������������������������������������������������������������������������������������
    ������������������������������������������������������������������������������������������
    ������������������������������������������������������������������������������������
    ���������������������������������������������������������������������������������������������������������������������������������������������������������������������

OUTPUT FOR INPUT:

    ������������������������������������������������������������������������������������������
    ������������������������������������������������������������������������������������������
    ������������������������������������������������������������������������������������
    ������������������������������������������������������������������������������������������
    ���������������������������������������������������������������������������

## wave2tilde

Converts WAVE DASH (U+301C) and WAVY DASH (U+3030) into tilde (U+FF5E).

## tilde2wave

Converts tilde (U+FF5E) into wave (U+301C).

## wavetilde2long

Converts WAVE DASH (U+301C), WAVY DASH (U+3030) and tilde (U+FF5E) into long (U+30FC).

## wave2long

Converts WAVE DASH (U+301C) and WAVY DASH (U+3030) into long (U+30FC).

## tilde2long

Converts tilde (U+FF5E) into long (U+30FC).

## fullminus2long

Converts FULLWIDTH HYPHEN-MINUS (U+FF0D) into long (U+30FC).

## dashes2long

Converts the following characters into long (U+30FC).

    U+2012  FIGURE DASH
    U+2013  EN DASH
    U+2014  EM DASH
    U+2015  HORIZONTAL BAR

Note that this option does not convert hyphens into long.

## drawing\_line2long

Converts the following characters into long (U+30FC).

    U+2500  BOX DRAWINGS LIGHT HORIZONTAL
    U+2501  BOX DRAWINGS HEAVY HORIZONTAL
    U+254C  BOX DRAWINGS LIGHT DOUBLE DASH HORIZONTAL
    U+254D  BOX DRAWINGS HEAVY DOUBLE DASH HORIZONTAL
    U+2574  BOX DRAWINGS LIGHT LEFT
    U+2576  BOX DRAWINGS LIGHT RIGHT
    U+2578  BOX DRAWINGS HEAVY LEFT
    U+257A  BOX DRAWINGS HEAVY RIGHT

## unify\_long\_repeats

Unifies long (U+30FC) repeats.

## nl2space

Converts new lines (LF, CR, CRLF) into SPACE (U+0020).

## unify\_nl

Unifies new lines.

## unify\_long\_spaces

Unifies long spaces (U+0020 and U+3000).

## unify\_whitespaces

Converts the following characters into SPACE (U+0020).

    U+000B  LINE TABULATION
    U+000C  FORM FEED
    U+0085  NEXT LINE
    U+00A0  NO-BREAK SPACE
    U+1680  OGHAM SPACE MARK
    U+2000  EN QUAD
    U+2001  EM QUAD
    U+2002  EN SPACE
    U+2003  EM SPACE
    U+2004  THREE-PER-EM SPACE
    U+2005  FOUR-PER-EM SPACE
    U+2006  SIX-PER-EM SPACE
    U+2007  FIGURE SPACE
    U+2008  PUNCTUATION SPACE
    U+2009  THIN SPACE
    U+200A  HAIR SPACE
    U+2028  LINE SEPARATOR
    U+2029  PARAGRAPH SEPARATOR
    U+202F  NARROW NO-BREAK SPACE
    U+205F  MEDIUM MATHEMATICAL SPACE

Note that this option does not convert the following characters:

    U+0009  CHARACTER TABULATION
    U+000A  LINE FEED
    U+000D  CARRIAGE RETURN
    U+3000  IDEOGRAPHIC SPACE

## trim

Removes leading and trailing whitespace.

## ltrim

Removes only leading whitespace.

## rtrim

Removes only trailing whitespace.

## old2new\_kana

    INPUT  OUTPUT FOR INPUT
    -----  --------------------
    ���     ���
    ���     ���
    ���     ���
    ���     ���
    ���     ������ (U+30A4, U+3099)
    ���     ������ (U+30A8, U+3099)

## old2new\_kanji

INPUT:

    ������������������������������������������������������������������������������������������
    ������������������������������������������������������������������������������������������
    ������������������������������������������������������������������������������������������
    ������������������������������������������������������������������������������������������
    ������������������������������������������������������������������������������������������
    ������������������������������������������������������������������������������������������
    ������������������������������������������������������������������������������������������
    ������������������������������������������������������������������������������������������
    ������������������������������������������������������������������������������������������
    ������������������������������������������������������������������������������������������
    ������������������������������������������������������������������������������������������
    ������������������������������������������������������������������������������������������
    ���������������������������������������������������������������������������

OUTPUT FOR INPUT:

    ������������������������������������������������������������������������������������������
    ������������������������������������������������������������������������������������������
    ������������������������������������������������������������������������������������������
    ������������������������������������������������������������������������������������������
    ������������������������������������������������������������������������������������������
    ������������������������������������������������������������������������������������������
    ������������������������������������������������������������������������������������������
    ������������������������������������������������������������������������������������������
    ������������������������������������������������������������������������������������������
    ������������������������������������������������������������������������������������������
    ������������������������������������������������������������������������������������������
    ������������������������������������������������������������������������������������������
    ���������������������������������������������������������������������������

## tab2space

Converts CHARACTER TABULATION (U+0009) into SPACE (U+0020).

## remove\_controls

Removes the following control characters:

    U+0000 .. U+0008
    U+000B
    U+000C
    U+000E .. U+001F
    U+007F .. U+009F

Note that this option does not remove the following characters:

    U+0009  CHARACTER TABULATION
    U+000A  LINE FEED
    U+000D  CARRIAGE RETURN

## remove\_DFC

Removes the following Directional Formatting Characters:

    U+061C  ARABIC LETTER MARK
    U+2066  LEFT-TO-RIGHT ISOLATE
    U+2067  RIGHT-TO-LEFT ISOLATE
    U+2068  FIRST STRONG ISOLATE
    U+2069  POP DIRECTIONAL ISOLATE
    U+200E  LEFT-TO-RIGHT MARK
    U+200F  RIGHT-TO-LEFT MARK
    U+202A  LEFT-TO-RIGHT EMBEDDING
    U+202B  RIGHT-TO-LEFT EMBEDDING
    U+202C  POP DIRECTIONAL FORMATTING
    U+202D  LEFT-TO-RIGHT OVERRIDE
    U+202E  RIGHT-TO-LEFT OVERRIDE

See [http://www.unicode.org/reports/tr9/](http://www.unicode.org/reports/tr9/) for more information about Directional Formatting Characters.

## remove\_spaces

Removes SPACE (U+0020) and IDEOGRAPHIC SPACE (U+3000).

## dakuon\_normalize, handakuon\_normalize, all\_dakuon\_normalize

See [Lingua::JA::Dakuon](https://metacpan.org/pod/Lingua::JA::Dakuon).

Note that Lingua::JA::NormalizeText enables $Lingua::JA::Dakuon::EnableCombining flag.

## square2katakana, circled2kana, circled2kanji

See [Lingua::JA::Moji](https://metacpan.org/pod/Lingua::JA::Moji).

## decompose\_parenthesized\_kanji

Decomposes the following parenthesized kanji:

    ������������������������������������������������������������������������������������������������������������

# AUTHOR

pawa <pawapawa@cpan.org>

# SEE ALSO

[���������������](http://www.asahi-net.or.jp/~ax2s-kmtn/ref/old_chara.html)

[������������](http://ja.wikipedia.org/wiki/%E5%BA%B7%E7%86%99%E5%AD%97%E5%85%B8)

[Lingua::JA::Regular::Unicode](https://metacpan.org/pod/Lingua::JA::Regular::Unicode)

[Lingua::JA::Dakuon](https://metacpan.org/pod/Lingua::JA::Dakuon)

[Lingua::JA::Moji](https://metacpan.org/pod/Lingua::JA::Moji)

[Unicode::Normalize](https://metacpan.org/pod/Unicode::Normalize)

[Unicode::Number](https://metacpan.org/pod/Unicode::Number)

[HTML::Entities](https://metacpan.org/pod/HTML::Entities)

[HTML::Scrubber](https://metacpan.org/pod/HTML::Scrubber)

# LICENSE

This library is free software; you can redistribute it and/or modify
it under the same terms as Perl itself.