ShibataProjects の履歴(No.8)

Git repos
Google Drive
日本語ファイル（e-learning 用に翻訳したもの）
- ソースファイル
- 行数確認
英語ファイル

Git repos†

↑

Google Drive†

drive-20210719.zip

↑

日本語ファイル（e-learning 用に翻訳したもの）†

↑

ソースファイル†

LFS201-text-master.zip

↑

行数確認†

元のファイルは UTF-8、CRLF（Windows) 形式

[local] munakata:~/latex/LFS201/JP_orig$ file Chapter9.txt 
Chapter9.txt: UTF-8 Unicode text, with CRLF line terminators

CRLF → LF 一括変換

[local] munakata:~/latex/LFS201/JP_orig$ find . -name '*.txt' | xargs file | grep CRLF | awk -F: '{print $1}' | xargs nkf -Lu --overwrite

空行を除いたファイルをダンプ（部分）

Chapter9.txt:クイズ開始
Chapter9.txt:問題 9.1
Chapter9.txt:zypper install <package>は、新しいパッケージをインストールするために使います。True or False?
Chapter9.txt:A. True
Chapter9.txt:B. False
Chapter9.txt:問題 9.2
Chapter9.txt:zypper updateは、パッケージを引数として指定できません。True or False?
Chapter9.txt:A. True
Chapter9.txt:B. False

空行を除く行数カウント → 3,133 行

[local] munakata:~/latex/LFS201/JP_orig$ grep -v ^\s*$ *.txt | wc -L
3133

変換元データファイルを作成
- 変換元ファイル（ファイルマージ、ファイル名付加、空行削除） ----> JP_orig.txt
- フォーマットは [ファイル名（拡張子なし）]:[index 番号]:日本語テキストとする
- 変換プログラム (add_index.py)
  - add_index.py

変換サンプル

Chapter9:000100:クイズ開始
Chapter9:000101:問題 9.1
Chapter9:000102:zypper install <package>は、新しいパッケージをインストールするために使います。True or False?
Chapter9:000103:A. True
Chapter9:000104:B. False
Chapter9:000105:問題 9.2
Chapter9:000106:zypper updateは、パッケージを引数として指定できません。True or False?
Chapter9:000107:A. True
Chapter9:000108:B. False

↑

英語ファイル†

↑

処理プロトコル案１・・・失敗（detex の精度が低くテキストの再現性が低い）†

tex to txt conversion (drop tex control sequences) ---> OpenDetex or pandoc
eliminate blank lines （done）
add index to each line（done）
reflect index to tex file (aborted)
compare & match JP_index to EN_index
- replace EN text to JP text
- delete index
- try compling tex files

↑

処理プロトコル案２・・・pdf からテキストをインデックス化し tex ファイルに反映†

pdftotest でテキスト化
ファイル名（拡張子抜き）：インデックス：テキストの作成
比較用アプリの作成とインデックスマッチング
オリジナル tex ファイルにインデックスの挿入（または、インデックスに置き換え） → 自動変換のベース

↑

plagiarism detection python (盗用・剽窃チェック) アルゴリズムの検討†

tex ファイル中の該当箇所を検出するアルゴリズムを検討
- 単純に pdf からテキスト化された文章（当然複数の単語からなる）と Texコマンドを含む Tex ソースファイルを比較した場合、適切にマッチできないことが判明。全くマッチしないわけではなく、複数行の最後の部分でマッチするケースもあった。
使えるかもしれない（が、目的が違うので微妙かもしれない）類似プログラム --- pysimilar

↑

単純な文章の類似度判定†

Pythonで文章の類似度を計算する方法〜TF-IDFとcos類似度〜

↑

pdftotext と detex で特殊文字が違ったエンコードになった（些細だが、ハマる）†

'What is "Cloud Native" and how it works?' <----- detex (from Tex)
'What is ”Cloud Native” and how it works?\n'  <---- pdftotext (from PDF)