MacにBayonをインストールする方法

bayonは汎用的に利用できるデータクラスタリングツールです。シンプルな構成で、かつ大規模なデータに対しても高速に実行できるところが特徴。大量のデータを俯瞰したいときや似た者同士のグループにサクッと分けて確認したいといったときに使えるツールでインストールしておくと何かと重宝します。

環境

インストール手順

bayonの最新バージョンは2015年現時点では、bayon ver.0.1.1になります。

% tar xvzf bayon-0.1.1.tar.gz

x bayon-0.1.1/
x bayon-0.1.1/byvector.h
x bayon-0.1.1/TODO
x bayon-0.1.1/analyzer.h
x bayon-0.1.1/document.h
x bayon-0.1.1/COPYING
x bayon-0.1.1/utiltest.cc
x bayon-0.1.1/bayon.h
x bayon-0.1.1/anatest.cc
x bayon-0.1.1/Makefile.in
x bayon-0.1.1/byvector.cc
x bayon-0.1.1/configure.in
x bayon-0.1.1/document.cc
x bayon-0.1.1/config.h.in
x bayon-0.1.1/util.h
x bayon-0.1.1/clutest.cc
x bayon-0.1.1/clatest.cc
x bayon-0.1.1/cluster.h
x bayon-0.1.1/classifier.h
x bayon-0.1.1/bayon.cc
x bayon-0.1.1/configure
x bayon-0.1.1/lda.cc
x bayon-0.1.1/analyzer.cc
x bayon-0.1.1/README
x bayon-0.1.1/classifier.cc
x bayon-0.1.1/data/
x bayon-0.1.1/data/test3.tsv
x bayon-0.1.1/data/test1.tsv
x bayon-0.1.1/data/test2.tsv
x bayon-0.1.1/data/test4.tsv
x bayon-0.1.1/VCmakefile
x bayon-0.1.1/doctest.cc
x bayon-0.1.1/Doxyfile
x bayon-0.1.1/plsi.cc
x bayon-0.1.1/util.cc
x bayon-0.1.1/cluster.cc
x bayon-0.1.1/vectest.cc
% cd bayon-0.1.1
/bayon-0.1.1% ./configure

checking for gcc... gcc-4.2
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables...
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether gcc-4.2 accepts -g... yes
checking for gcc-4.2 option to accept ISO C89... none needed
checking for g++... g++
checking whether we are using the GNU C++ compiler... yes
checking whether g++ accepts -g... yes
checking how to run the C++ preprocessor... g++ -E
checking for grep that handles long lines and -e... /usr/bin/grep
checking for egrep... /usr/bin/grep -E
checking for ANSI C header files... yes
checking for sys/types.h... yes
checking for sys/stat.h... yes
checking for stdlib.h... yes
checking for string.h... yes
checking for memory.h... yes
checking for strings.h... yes
checking for inttypes.h... yes
checking for stdint.h... yes
checking for unistd.h... yes
checking google/dense_hash_map usability... no
checking google/dense_hash_map presence... no
checking for google/dense_hash_map... no
checking ext/hash_map usability... yes
checking ext/hash_map presence... yes
checking for ext/hash_map... yes
checking gtest/gtest.h usability... no
checking gtest/gtest.h presence... no
checking for gtest/gtest.h... no
configure: WARNING: The test tools of bayon require gtest. If you use test tools, you must install it.
configure: creating ./config.status
config.status: creating Makefile
config.status: creating config.h
/bayon-0.1.1% make                                                                                                                                       [GIT:]
g++ -c -I. -I/usr/local/include/bayon -I/Users/zero/include -I/usr/local/include -D_GNU_SOURCE=1 -DHAVE_CONFIG_H -Wall -fPIC -O3 analyzer.cc
g++ -c -I. -I/usr/local/include/bayon -I/Users/zero/include -I/usr/local/include -D_GNU_SOURCE=1 -DHAVE_CONFIG_H -Wall -fPIC -O3 byvector.cc
g++ -c -I. -I/usr/local/include/bayon -I/Users/zero/include -I/usr/local/include -D_GNU_SOURCE=1 -DHAVE_CONFIG_H -Wall -fPIC -O3 classifier.cc
g++ -c -I. -I/usr/local/include/bayon -I/Users/zero/include -I/usr/local/include -D_GNU_SOURCE=1 -DHAVE_CONFIG_H -Wall -fPIC -O3 cluster.cc
g++ -c -I. -I/usr/local/include/bayon -I/Users/zero/include -I/usr/local/include -D_GNU_SOURCE=1 -DHAVE_CONFIG_H -Wall -fPIC -O3 document.cc
g++ -c -I. -I/usr/local/include/bayon -I/Users/zero/include -I/usr/local/include -D_GNU_SOURCE=1 -DHAVE_CONFIG_H -Wall -fPIC -O3 util.cc
ar rv libbayon.a analyzer.o byvector.o classifier.o cluster.o document.o util.o
ar: creating archive libbayon.a
a - analyzer.o
a - byvector.o
a - classifier.o
a - cluster.o
a - document.o
a - util.o
g++ -Wall -fPIC -O3 -dynamiclib -o libbayon.1.1.0.dylib \
	  -install_name /usr/local/lib/libbayon.1.dylib \
	  -current_version 1.1.0 -compatibility_version 1 \
	  analyzer.o byvector.o classifier.o cluster.o document.o util.o -L. -L/usr/local/lib -L/Users/zero/lib -L/usr/local/lib
ld: warning: directory not found for option '-L/Users/zero/lib'
ln -f -s libbayon.1.1.0.dylib libbayon.1.dylib
ln -f -s libbayon.1.1.0.dylib libbayon.dylib
g++ -c -I. -I/usr/local/include/bayon -I/Users/zero/include -I/usr/local/include -D_GNU_SOURCE=1 -DHAVE_CONFIG_H -Wall -fPIC -O3 bayon.cc
LD_RUN_PATH=.:/lib:/usr/lib:/usr/local/lib:/Users/zero/lib:/usr/local/lib:/usr/local/lib g++ -Wall -fPIC -O3 -o bayon bayon.o -L. -L/usr/local/lib -L/Users/zero/lib -L/usr/local/lib  -lbayon
ld: warning: directory not found for option '-L/Users/zero/lib'
/bayon-0.1.1% sudo make install

Password:
mkdir -p /usr/local/include/bayon
cp -Rf bayon.h analyzer.h byvector.h classifier.h cluster.h document.h util.h config.h /usr/local/include/bayon
mkdir -p /usr/local/lib
cp -Rf libbayon.a libbayon.1.1.0.dylib libbayon.1.dylib libbayon.dylib /usr/local/lib
mkdir -p /usr/local/bin
cp -Rf bayon /usr/local/bin
/bayon-0.1.1% bayon

bayon: simple and fast clustering tool

Usage:
* Clustering input data
 % bayon -n num [options] file
 % bayon -l limit [options] file
    -n, --number=num      the number of clusters
    -l, --limit=lim       limit value of cluster bisection
    -p, --point           output similarity points
    -c, --clvector=file   save vectors of cluster centroids
    --clvector-size=num   max size of output vectors of
                          cluster centroids (default: 50)
    --method=method       clustering method(rb, kmeans), default:rb
    --seed=seed           set a seed for random number generator

* Get the similar clusters for each input documents
 % bayon -C file [options] file
    -C, --classify=file   target vectors
    --inv-keys=num        max size of the keys of each vector to be
                          looked up in inverted index (default: 20)
    --inv-size=num        max size of the inverted index of each key
                          (default: 100)
    --classify-size=num   max size of output similar groups
                          (default: 20)

* Common options
    --vector-size=num     max size of each input vector
    --idf                 apply idf to input vectors
    -h, --help            show help messages
    -v, --version         show the version and exit

これでbayonが使えるようになりました。

bayonの使い方

bayonの入力データはタブ区切りのテキストファイル
document_id1 \t key1-1 \t value1-1 \t key1-2 \t value1-2 \t ...\n
document_id2 \t key2-1 \t value2-1 \t key2-2 \t value2-2 \t ...\n
...
サンプルの入力データ
阿佐田	J-POP	10	J-R&B	6	ロック	4
小島	ジャズ	8	レゲエ	9		
古川	クラシック	4	ワールド	4		
田村	ジャズ	9	メタル	2	レゲエ	6
青柳	J-POP	4	ロック	3	HIPHOP	3
三輪	クラシック	8	ロック	1		

関連