Problem installing VSM 2.1

classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

Problem installing VSM 2.1

BigJim
I am having a tough time installing vsm 2.1.0 on Ubuntu 14.04. I did a test install in a VM environment and had similar problems. I was able to install VSM 2.0. but not VSM 2.1 or VSM 2.2

I am now installing on physical hardware: 1 management node and 3 Monitor/OSD nodes.

I am wondering if there may be a problem with the install.sh file in 2.1
When I run
sudo ./install.sh -u cephuser -v 2.1
(cephuser is my username per vsm pre-flight) it proceeds to download a lot of packages, and errors here:

Reading package lists... Done
++ dpkg -s dpkg-dev
++ grep 'install ok installed'
++ wc -l
+ IS_DPKG_DEV=1
+ [[ 1 -eq 0 ]]
+ mkdir -p vsm-dep-repo/vsm-dep-repo
+ cd vsm-dep-repo
+ cp '*.deb' vsm-dep-repo
cp: cannot stat ‘*.deb’: No such file or directory


There are no files in vsm-dep-repo
There are files in vsmrepo

It looks like the *.deb packages are not making it into the vsm-dep-repo folder.

Tried
./install.sh –u Ubuntu –v 2.1 –prepare
 from the thread http://vsm-discuss.33411.n7.nabble.com/installing-Intel-VSM-td66.html and the error is the same.
Not sure what to make of
sudo -E preinstall
or syntax?

I see some other references to the same problem, but no resolution.

Any support is appreciated.

Thanks

Reply | Threaded
Open this post in threaded view
|

Re: Problem installing VSM 2.1

ywang19
Administrator
hi Jim,

vsm-dep-repo is expected to save vsm dependent packages, which are hosted at https://github.com/01org/vsm-dependencies. If there is no vsm-dep-repo exists, installer should grab them from vsm-dependencies repo, but branch 2.2 doesn't create, which causes no dependent packages downloaded. One alternative is to download all deb packages, and put them into local vsm-dep-repo before installing.

-yaguang

在 2016年10月4日,上午5:37,BigJim [via vsm-discuss] <[hidden email]> 写道:

I am having a tough time installing vsm 2.1.0 on Ubuntu 14.04. I did a test install in a VM environment and had similar problems. I was able to install VSM 2.0. but not VSM 2.1 or VSM 2.2

I am now installing on physical hardware: 1 management node and 3 Monitor/OSD nodes.

I am wondering if there may be a problem with the install.sh file in 2.1
When I run
sudo ./install.sh -u cephuser -v 2.1
(cephuser is my username per vsm pre-flight) it proceeds to download a lot of packages, and errors here:

Reading package lists... Done
++ dpkg -s dpkg-dev
++ grep 'install ok installed'
++ wc -l
+ IS_DPKG_DEV=1
+ [[ 1 -eq 0 ]]
+ mkdir -p vsm-dep-repo/vsm-dep-repo
+ cd vsm-dep-repo
+ cp '*.deb' vsm-dep-repo
cp: cannot stat ‘*.deb’: No such file or directory


There are no files in vsm-dep-repo
There are files in vsmrepo

It looks like the *.deb packages are not making it into the vsm-dep-repo folder.

Tried
./install.sh –u Ubuntu –v 2.1 –prepare
 from the thread http://vsm-discuss.33411.n7.nabble.com/installing-Intel-VSM-td66.html and the error is the same.
Not sure what to make of
sudo -E preinstall
or syntax?

I see some other references to the same problem, but no resolution.

Any support is appreciated.

Thanks




If you reply to this email, your message will be added to the discussion below:
http://vsm-discuss.33411.n7.nabble.com/Problem-installing-VSM-2-1-tp555.html
To start a new topic under vsm-discuss, email [hidden email]
To unsubscribe from vsm-discuss, click here.
NAML
Reply | Threaded
Open this post in threaded view
|

Re: Problem installing VSM 2.1

BigJim
I have now reinstalled the OS on the Management node, and I am attempting install of 2.1 again. I do get a bit further. My 4 nodes are:
Management vsma 10.0.40.10
osd1/monitor t300a 10.0.40.11
osd2/monitor t300b 10.0.40.12
osd3/monitor t300c 10.0.40.13

The install seems to go smoothly on 10.0.40.10. When it gets to the first osd node 10.0.40.11 I am prompted for the password about 18 times. After each password entry it does a bit of install then prompts for password again. It finally stops with this error.

No handlers could be found for logger "vsm.manifest.parser"
Traceback (most recent call last):
  File "/usr/local/bin/server_manifest", line 325, in <module>
    smp = ManifestChecker(fpath)
  File "/usr/local/bin/server_manifest", line 56, in __init__
    self._info = self._smp.format_to_json(check_manifest_tag=True)
  File "/usr/local/lib/python2.7/dist-packages/vsm/manifest/parser.py", line 582, in format_to_json
    return self._format_server_manifest_to_json(check_manifest_tag)
  File "/usr/local/lib/python2.7/dist-packages/vsm/manifest/parser.py", line 414, in _format_server_manifest_to_json
    self._dict_insert_auth_key()
  File "/usr/local/lib/python2.7/dist-packages/vsm/manifest/parser.py", line 371, in _dict_insert_auth_key
    raise
TypeError: exceptions must be old-style classes or derived from BaseException, not NoneType
Connection to 10.0.40.11 closed.

I have gone through the ssh config and I can ssh into all nodes with no password. I have rebuilt the manifests several times, and re-downloaded and installed the 2.1 ubuntu package. It always ends at the same spot vsm.manifest.parser.

The cluster.manifest:

[cluster]
cluster_a

[file_system]
xfs

[management_addr]
10.0.40.0/24

[ceph_public_addr]
10.0.40.0/24

[ceph_cluster_addr]
10.0.50.0/24

This is a 2 nic setup. The management node is not on the ceph_cluster_addr 10.0.50.0/24 subnet, only the management_addr subnet.


On the server manifest I have noticed that token-tenant disappears from the [auth_key] variable after the attempted install on the 10.0.40.11 node. It is still there on the other 2 osd node manifests.



If I change the installrc to use hostname instead of IP, I get an error right away.

dpkg-scanpackages: info: Wrote 1383 entries to output Packages file.
+ cd /home/mnadmin/Downloads/2.1.0-336
+ rm -rf vsm.list vsm-dep.list
+ cat
+ cat
+ install_controller
+ check_manifest vsma
+ [[ vsma == vsma ]]
+ [[ ! -d manifest/vsma ]]
+ echo 'Please check the manifest, then try again.'
Please check the manifest, then try again.
+ exit 1


Not sure which manifest is being referenced here.
Reply | Threaded
Open this post in threaded view
|

Re: Problem installing VSM 2.1

BigJim
Update

I had 2 Nics

em1 192.168.x.x for internet
em2 10.0.40.10 for cluster management.

I tried switching them.

em1 10.0.40.10 for cluster management.
em2 192.168.x.x for Internet

no joy.

finally deleted em2 and setup vlan to get Internet access on 10.0.40.10.

Success!  Install completed, and I created a 3 node cluster with monitors and osds on all 3 nodes.

Now 2 new problems if anyone has some insight.

Clock skew detected on mon.1, mon.2
Not sure the best way to resolve this. looks like the clock skew is  mon..1 0.520963s > max 0.2s mon.2 0.482344s > max 0.2s

and

192 pgs stuck inactive
192 pgs stuck unclean

If I run a ceph -s on the first osd node - the last line is "192 creating". So I am giving it time to build the cluster. Not sure on how long that should take. I have 12 160gb osd drives and it is showing 12 osds: 0 up, 0 in.
Reply | Threaded
Open this post in threaded view
|

Re: Problem installing VSM 2.1

bxzhu
hi BigJim,
        You can set the controller node as ntp server and the other node as ntp client to keep your system time synchronized. So the first issue will be solved.
        But looking from the second issue(12 osds: 0 up, 0 in), your ceph cluster was not setup successfully. I think you can do as the following steps:
        1. ceph --version    --> please provide the version of your ceph
        2. you can try to start the osd manually on the t300a, t300b and t300c.
        3. If possible, can you run server_manifest on the t300a node and provide the process and result of the command.
Reply | Threaded
Open this post in threaded view
|

Re: Problem installing VSM 2.1

BigJim
Thankyou.

I was able to resolve the ntp problem.

The Ceph version was 0.80.11. I was able to upgrade using the VSM Dashboard to 0.94.9

I have stopped and started the osd manually on t300a t300b and t300c. I have also started them from the dashboard.

There are 4 osds per node so 12 osds total. They are all out-down. I created them as "7200_rpm_sata capacity" not as "10krpm_sas performance". The 3 default pools are set for "10krpm_sas performance" so I created a pool in the "7200_rpm_sata capacity" storage group hoping that was my problem. It was not.
The osd's all show up as green under manage devices. The Monitors are Healthy_OK and the osd summary lists vsm status as present.

Here is the server_manifest.

Not sure how to execute the agent_token command to check auth_keys.

Also when I reboot the server I get this message for the osd's.


Thanks in advance.

-BigJim
Reply | Threaded
Open this post in threaded view
|

Re: Problem installing VSM 2.1

bxzhu
You can run "agent-token" on the controller(management) node to get the auth_key.
Then you can run "replace-str <auth_key>" on osd node to replace the auth-key which store in the file /etc/manifest/server.manifest.
At last, run command "server_manifest" on osd node to check the connection between controller node and osd node. If your result is "Check Success ~~", congratulation! Then you can do as followed to re-setup the ceph cluster:
1. run command "clean-data -f" on all the nodes(controller node and osd node).
2. run command "ps -ef|grep vsm;ps -ef|grep ceph" to check all the vsm and ceph related services have been stop.
3. run command "agent-token" on controller node to get auth_key
4. run command "replace-str <auth_key>;service vsm-agent restart;service vsm-physical restart" on osd node
5. run command "service vsm-agent status;service vsm-physical status" to make sure that vsm-agent and vsm-physical service have been started successfully.
Reply | Threaded
Open this post in threaded view
|

Re: Problem installing VSM 2.1

BigJim
Looks like I might have bigger problems
any ideas?
Reply | Threaded
Open this post in threaded view
|

Re: Problem installing VSM 2.1

BigJim
Solved it.
I ran the commands on their own line successufully.

The one that did not work was service vsm-agent status and service vsm-physical status.
They both show as off.
I double checked with service --status-all and they are all off.

The servers do show up in the dashboard as available under create cluster. So I created the cluster.
server_manifest is "Check Succes ~~"

After 15 minutes.

Cluster Health_warn
64 pgs stuck inactive
64 pgs stuck unclean
Montiors Health_OK
osdmap e14: 12 osds: 0 up, 0 in
pgmap v15: 64 pgs, 1 pools, o bytes data, 0 objects 0 kB used, 0 kB / 0kB avail
64 creating
Reply | Threaded
Open this post in threaded view
|

Re: Problem installing VSM 2.1

BigJim
There must be something wrong with the agents.
The new cluster never came up.

I ran all the steps again and while I get Check Success~~ eventually the token expires and the servers no longer communicate with the management node.

I am going to try once more today, but start anew tomorrow.
Reply | Threaded
Open this post in threaded view
|

Re: Problem installing VSM 2.1

bxzhu
In reply to this post by BigJim
You can use "sudo service vsm-agent restart" and "sudo service vsm-physical restart" to restart services when you login in as non-root user.
BTW, if possible, after you have created the ceph cluster again, can you provide the log from the controller node(/var/log/vsm/*.log) and all osd nodes(/var/log/vsm/*.log). It can help me to analyze your problem.
At last, be sure that the vsm-api, vsm-conductor and vsm-scheduler are running on controller node and vsm-agent and vsm-physical are running on osd nodes.
Reply | Threaded
Open this post in threaded view
|

Re: Problem installing VSM 2.1

BigJim
Ok here is what I have done.
I deleted the entire install. stopped using the server vsma and added t300d. (I have 4 Dell T300's for this project)
I reformatted all drives, reinstalled Ubuntu on all nodes, and setup as follows
t300a VSM                           10.0.40.11
t300b OSD and Monitor         10.0.40.12 10.0.50.12
t300c OSD and Monitor         10.0.40.13 10.0.50.13
t300d OSD and Monitor         10.0.40.14 10.0.50.14
t300x_vsm.gz
Did the preflight per INSTALL.md version 2.1.0-336. then
sudo ./install.sh -u mnadmin -v 2.1
Install was ok on t300a, but on t300b, t300c, and t300d, I get prompted for mnadmin password about 20 times for each node. Is this normal?

Install completed successfully, and I was able to log in to the dashboard and see the 3 servers waiting to join the cluster. I created a new user, then let it sit over night. This morning I created the cluster (created it using the new user).

I have the same problem. All 3 monitors are Health_ok All 12 osd's are out-down. Dashboard says 192 pgs stuck inactive 192 pgs stuck unclean.
I did stop and start an osd but it did not change the dashboard.

now it looks like there is and extra osd when I ceph -s on t300b.
mnadmin@t300b:~$ ceph -s
    cluster b3ff53e0-a297-11e6-bc86-001e4f3770c1
     health HEALTH_WARN 192 pgs stuck inactive; 192 pgs stuck unclean
     monmap e1: 3 mons at {0=10.0.40.12:6789/0,1=10.0.40.13:6789/0,2=10.0.40.14:6789/0}, election epoch 4, quorum 0,1,2 0,1,2
     osdmap e17: 13 osds: 0 up, 0 in
      pgmap v18: 192 pgs, 3 pools, 0 bytes data, 0 objects
            0 kB used, 0 kB / 0 kB avail
                 192 creating


I created the same cluster.manifest as above. The server.manifest was changed for the correct IP's and I moved the osd's from 7200_rpm_sata to 10krpm_sas just because it looks like that is the default.

osd and journals are the same on all 3 servers

/dev/sdb1 /dev/sda6
/dev/sdc1 /dev/sda7
/dev/sdd1 /dev/sda8
/dev/sde1 /dev/sda9

I attached the files you requested.

Thank you again for your assistance.

Reply | Threaded
Open this post in threaded view
|

Re: Problem installing VSM 2.1

BigJim
forgot to mention. I am running Ceph 0.80.11 on the new install. Can upgrade, but thought I would hold off for now.
Reply | Threaded
Open this post in threaded view
|

Re: Problem installing VSM 2.1

bxzhu
In reply to this post by BigJim
hi BigJim,
    Seeing from the vsm-agent.log of t300b, when creating osds, there are some issues as followed:
2016-11-04 09:05:16     INFO [vsm.utils] Running cmd = sudo vsm-rootwrap /etc/vsm/rootwrap.conf mount -t xfs -o rw,noatime,inode64,logbsize=256k,delaylog /dev/sdd1 /var/lib/ceph/osd/osd2
2016-11-04 09:05:16     INFO [vsm.utils] stdout =
2016-11-04 09:05:16     INFO [vsm.utils] stderr = mount: wrong fs type, bad option, bad superblock on /dev/sdd1,
       missing codepage or helper program, or other error
       In some cases useful info is found in syslog - try
       dmesg | tail  or so

    So please make sure that you have /dev/sdd1 partition and you can use xfs filesystem now.
    Do test like this: If you have /dev/sdd1 partition and a test directory "/tmp/test". Run command "mount -t xfs -o rw,noatime,inode64,logbsize=256k,delaylog /dev/sdd1 /tmp/test".
Reply | Threaded
Open this post in threaded view
|

Re: Problem installing VSM 2.1

BigJim
I pretty much destroyed my last install over the weekend, so I rebuilt the cluster again.
I have formatted the disks as xfs using parted, the Ubuntu installer disk, and this last time using ceph-disk zap, then ceph-disk prepare --fs-type xfs /dev/sdd
I think the problem may be with a bad mount option.
I do not know a lot about setting mount points. Every thing I read is "format" the disk then "sudo mkdir /tmp/test". I am confused as to how that will put a directory on sdd1. there must be some value that says put /tmp/test on sdd1. why wouldn't it just make a directory in the current drive?

anyway,... I digress.
I believe /var/lib/ceph/osd/osd2 is the correct mount point for /dev/sdd1

If I look at /etc/fstab, this is the content
//dev/sdd1 /var/lib/ceph/osd/osd2 xfs rw,noatime,inode64,logbsize-256k,delaylog 0 0 ## forvsmosd
//dev/sdc1 /var/lib/ceph/osd/osd1 xfs rw,noatime,inode64,logbsize-256k,delaylog 0 0 ## forvsmosd
//dev/sdb1 /var/lib/ceph/osd/osd0 xfs rw,noatime,inode64,logbsize-256k,delaylog 0 0 ## forvsmosd
//dev/sde1 /var/lib/ceph/osd/osd3 xfs rw,noatime,inode64,logbsize-256k,delaylog 0 0 ## forvsmosd

based on your suggestion I ran:
"sudo mount -t xfs -o rw,noatime,inode64,logbsize=256k,delaylog /dev/sdd1 /var/lib/ceph/osd/osd2"
I get the error
mount: wrong fs type, bad option, bad superblock on /dev/sdd1
missing codepage or helper program, or other error
       In some cases useful info is found in syslog - try
       dmesg | tail  or so


If i run "sudo mount -a" I get the error
 
mount: wrong fs type, bad option, bad superblock on /dev/sdd1
missing codepage or helper program, or other error
       In some cases useful info is found in syslog - try
       dmesg | tail  or so

for all 4 disks, sdb1, sdc1, sdd1, and sde1.

but if I run "sudo mount -t xfs /dev/sdd1 /var/lib/ceph/osd/osd2" the drive shows as being mounted when I type "mount"

So I think it may be one of the options listed.
rw,noatime,inode64,logbsize=256k,delaylog

I found a similar error by someone using proxmox and they changed the fstab
FROM:
osd mount options xfs = rw,noatime,inode64,logbsize=256k,delaylog,allocsize=4M

TO:
osd mount options xfs = rw,noatime,inode64,logbsize=256k,allocsize=4M

and had success.
see: https://forum.proxmox.com/threads/proxmox-4-2-ceph-hammer-create-osd-failed.28047/


I think it will work once I can get the drives to mount correctly.
I did upgrade this install to ceph 0.94.9

ideas?


Reply | Threaded
Open this post in threaded view
|

Re: Problem installing VSM 2.1

bxzhu
Hi BigJim,

Have you tried the way to set "osd mount options xfs" to "rw,noatime,inode64,logbsize=256k,allocsize=4M"? Does it work?
Reply | Threaded
Open this post in threaded view
|

Re: Problem installing VSM 2.1

BigJim
when I set the mount options to "rw,noatime,inode64,logbsize=256k,allocsize=4M" there is a mount error listed in the boot sequence, the drives are mounted, BUT, the cluster still does not work. When i check the fstab file, it has reset to the previous string. I did some more digging, and it looks like delaylog has been depreciated in Ubuntu 16. I thought maybe the updates i was running were at fault. so I redeployed without running sudo apt-get update and upgrade. Still the same error.

I decided to take a step backwards and see what I can find out.

I created a ceph cluster using virtual machines and can replicate a similar problem.
If I format the drives in advance the cluster will not work.
I have 5 drives sdb, sdc, sdd, sde, and sdf where sdf is my journal drive. I partition as follows.

sudo parted  /dev/sd(x) mklabel gpt                               -- for all 5 drives.

sudo parted /dev/sd(x) mkpart primary xfs 0% 100%       -- for sdb to sde.

sudo parted /dev/sdf mkpart primary 0% 25%
sudo parted /dev/sdf mkpart primary 25% 50%
sudo parted /dev/sdf mkpart primary 50% 75%
sudo parted /dev/sdf mkpart primary 75% 100%             --for the journal on sdf.

When I have installed ceph and am adding the osd's:
ceph-deploy osd create {node-name}:{disk}[:{path/to/journal}]
ex: ceph-deploy osd create node1:/dev/sdb1:/dev/sdf1
the command fails.

when I run ceph-deploy disk zap node1:sdb
then ceph-deploy osd create node1:sdb:/dev/sdf1
the command completes successfully, sdb now has a partition listed, but no osd is added to ceph osd tree.

When I run ceph-deploy disk zap node1:sdb node1:sdf
then ceph-deploy osd create node1:sdb:sdf
The command runs successfully, and an osd is created /dev/sdb1 and a journal /dev/sdf1

I know this is not exactly the same error but I think it is close. Also this is ceph -v 10.2.4.

Questions:

1. Do the partition commands that I am running look correct? In the vsm install.pdf they use a full disk for the journal, not a disk with many partitions. Also the command listed in the install.pdf is missing the xfs parameter. ( I have tried both )
sudo parted -a optimal /dev/sdb -- mkpart primary 1MB 100%
not
sudo parted /dev/sd(x) mkpart primary xfs 0% 100%

2. I need to enter the password about 20 times for each node during install. That is 60 times for 3 nodes. Is this normal? I want to be sure I do not have a permissions problem as well.

Reply | Threaded
Open this post in threaded view
|

Re: Problem installing VSM 2.1

bxzhu
For ceph-deploy, if you make the journal partition by yourself, you should set the ceph:ceph permission to the partition.

QA
1. Have you tried to install other ceph versions like hammer or other?
2. How do you run the install.sh script and which user can login to the all nodes without-password.
Reply | Threaded
Open this post in threaded view
|

Re: Problem installing VSM 2.1

BigJim
I was running
sudo ./install.sh -u mnadmin -v 2.1
duh.
changed it to
./install.sh -u mnadmin -v 2.1
and now I only have a prompt at the install for each node.

still working through the mount problems.

mount: wrong fs type, bad option, bad superblock on /dev/sdb1,
missing codepage or helper program, or other error
In some cases usefull info is found in syslog - try
dmesg | tail or so

running dmesg | tail
[   28.205496] XFS (sdb1): unknown mount option [delaylog].
[   28.311526] XFS (sdd1): unknown mount option [delaylog].
[   84.113797] XFS (sdd1): unknown mount option [delaylog].
[   84.119065] XFS (sdb1): unknown mount option [delaylog].


[osd]        [journal]
/dev/sdb1 /dev/sdc1
/dev/sdd1 /dev/sde1

running VM's
this install is hammer.

have tried vsm 2.0 and 2.1 with firefly and hammer