Rigorous and Reliable (RAR) - Service management of Sage

Pages: 1 2 3 4 5 6

(09-07-2016 05:35 PM)lingu Wrote: [ -> ]This is too heavyweight -- it may even disrupt Sage if it is busy.

Sending a file takes 2-3s. This is considered long/heavyweight.

I ever tested to send a small file from s188 to s188. It only costs 0m0.009s. I think it is good enough.

Quote:
Quote:Then glad operative user & sage_user can use sage_key to execute remote command to sage_user@sage_portal.

2. The glad_portal IP can be get from $gb/conf/config.sh

How (a convar like glad_portal?) Have we made sure such a varaible must exist in $gb/conf/config.sh. If not, please add a note there "need implement" so that we know to create a TODO item and implement this later after this design is endorsed.

We already have a $glad_portal that can be used directly. It is set by util/install.sh during installing.

Quote:What if the wait_sage_start fails?

We need a way to check Sage has started. I thought this is what I thought you were doing. But it looks like here if auntie continues to fail the wait still completes without an error halt.

This is a bug. I will fix it.

(09-07-2016 05:53 PM)YU_Xinjie Wrote: [ -> ]
(09-07-2016 05:35 PM)lingu Wrote: [ -> ]This is too heavyweight -- it may even disrupt Sage if it is busy.

Sending a file takes 2-3s. This is considered long/heavyweight.

I ever tested to send a small file from s188 to s188. It only costs 0m0.009s. I think it is good enough.

The auntie command may complete in 0.009s, but the transfer should take longer because I added a 1s wait somewhere. Maybe it is not in the critical path...

Anyway, it is okay to keep it this way for a while.

Quote:We already have a $glad_portal that can be used directly. It is set by util/install.sh during installing.

OK. Then it's fine.

Quote:
Quote:What if the wait_sage_start fails?

We need a way to check Sage has started. I thought this is what I thought you were doing. But it looks like here if auntie continues to fail the wait still completes without an error halt.

This is a bug. I will fix it.

Please revise the design of this part first as a reply then I can reply to endorse the overall design.

The real "fix" should be after the design is endorsed.

(09-07-2016 06:18 PM)lingu Wrote: [ -> ]Please revise the design of this part first as a reply then I can reply to endorse the overall design.

The real "fix" should be after the design is endorsed.

Updated the code of "start sage" in the headpost.

Please take a look again.

(07-12-2016 05:54 PM)YU_Xinjie Wrote: [ -> ]goal

1. the script can be used in both sage_portal & glad_portal to start/stop/status sage.
2. user can start sage before glad, and stop sage after glad.

The script would replace the current start.sh/stop.sh/status.sh in the future.
The script may inspire us about how to set up regression tests for sage.

For shbio's project, I have already written a script /thinker/dstore/run/sage to support a part of the goal. I would try to generalize it.

config design

The brief idea is to use auntie to send a file to glad_portal. Then check the file content.

1.
require sage_user can password-less login sage_user@sage_portal
copy private key of sage_user@sage_portal into $gb/conf/sage_key

Then glad operative user & sage_user can use sage_key to execute remote command to sage_user@sage_portal.

2. The glad_portal IP can be get from $gb/conf/config.sh

3.
create a config file $gb/conf/sage_portal_ip to record the IP of sage_portal. Or we can also write it into $gb/conf/config.sh.

4. let sage_user@sage_portal can password-less login glad_user@glad_portal.

start sage

Code:

wait_sage_start () { content="This is test content." echo $content > $curdir/${USER}_sage_testfile test_file=/thinker/bin/ephemeral/667.sagetest # 30 seconds timeout try_cnt=10 while [[ ("$try_cnt" -gt "0") && ((! -f "$test_file") || ("$(cat $test_file)" != "$content")) ]]; do   timeout -s SIGINT 1 bash -c "sage_user=$sage_user $curdir/auntie ^$curdir/${USER}_sage_testfile $glad_portal 667 sagetest" || echo -n ''   echo "Trying to start Sage..."   sleep 2   try_cnt=$((try_cnt-1)) done rm -f $test_file if [[ "$try_cnt" == "0" ]];then   # timeout   return 1 else   return 0 fi } start_sage () { # check whether sage is really stopped. local tmp="$(status_sage)" if [[ "$tmp" != "Sage is stopped." ]]; then   echo "Sage is already started."   echo ""   echo "Status: "   echo "$tmp"   return 0 fi remote_cmd "screen -dmS sage_service bash -c 'cd ~/sage; ./start.sh'" local rt="succ" wait_sage_start || rt="fail" if [[ "$rt" == "succ" ]]; then   echo "Sage is started." else   echo "Sage fails to start." fi }

stop sage

Code:

stop_sage () { # stop sage remote_cmd "cd ~/sage; ./stop.sh" > /dev/null # kill all sage_service screens remote_cmd " for session in \$(screen -ls | grep 'sage_service' | grep -o '[0-9]\+\.' | grep -o '[0-9]\+') do     screen -S \${session} -p 0 -X quit done " # check whether sage is really stopped. local tmp="$(status_sage)" if [[ "$tmp" == "Sage is stopped." ]]; then   echo "Sage is stopped." else   echo "Sage failed to stop."   echo ""   echo "Status: "   echo "$tmp" fi }

status sage

Code:

status_sage () { local output="$(remote_cmd "cd ~/sage; ./status.sh")" if [[ "$output" == "Sage is stopped." ]]; then   echo "Sage is stopped." else   echo "$output" fi }

OK.

(09-07-2016 06:37 PM)YU_Xinjie Wrote: [ -> ]
(09-07-2016 06:18 PM)lingu Wrote: [ -> ]Please revise the design of this part first as a reply then I can reply to endorse the overall design.

The real "fix" should be after the design is endorsed.

Updated the code of "start sage" in the headpost.

Please take a look again.

Okayed the design.

But you ignored my word "as a reply". Pls make sure to execute precisely.

(09-07-2016 05:35 PM)lingu Wrote: [ -> ]
Quote:wait_sage_start () {
content="This is test content."
echo $content > $curdir/${USER}_sage_testfile
test_file=/thinker/bin/ephemeral/667.sagetest

# two minutes timeout
try_cnt=40
while [[ ("$try_cnt" -gt "0") && ((! -f "$test_file") || ("$(cat $test_file)" != "$content")) ]]; do
timeout -s SIGINT 1 bash -c "sage_user=$sage_user $curdir/auntie ^$curdir/${USER}_sage_testfile $glad_portal 667 sagetest" || echo -n ''
echo "Trying to start Sage..."
sleep 2
try_cnt=$((try_cnt-1))
done
rm -f $test_file
}

Too long wait. If it does not work in 30 sec, report an error and stop.

1. While implementing, I find "sage start" cost 21 seconds on average in my dev cluster. The perf of my dev cluster is not well since they are VMs.
Sometimes it would cost more than 30 seconds. Then the sage is actually started but my script reports failed.

Therefore I suggest we still use 2 minutes timeout here.

2. The sleep time is 2 seconds after each trial fails.
But I find even though auntie command is finish quickly if it succeeds, the received file content will exist after 2~3 seconds. So even though auntie sends file successfully, another auntie trial would be triggered. If it is triggered, script would print a ERROR info to confuse user but finally the Sage & service scripts would still succeed.

Therefore I suggest we use "sleep 4" after each trial fails. Then the confusing ERROR info would not appear frequently.

I will implement above change before your review, since I think they are minor change.

For the private key, I encounter an issue:

SSH command would check the premission of private key. If the key can readable by others, SSH would just ignore the key. One method to skip the check is to set the owner of key to be a user that would never use the key. For example:

Code:

[0][15:40:42] xinjie@devmac0e0:/thinker/dstore/gene/glad/conf

$ ll sage_key 

-rw-r--r--. 1 dstore dstore 1679 Sep  8 14:57 sage_key

I set the owner to be dstore. Then user gene, sage, xinjie can use the key to access sage_portal. Only user dstore can not do that.

Committed in df99edcb4f9ae617c098714f8036393fbdf8e413 .

(09-08-2016 04:48 PM)YU_Xinjie Wrote: [ -> ]For the private key, I encounter an issue:

SSH command would check the premission of private key. If the key can readable by others, SSH would just ignore the key. One method to skip the check is to set the owner of key to be a user that would never use the key. For example:

Code:

[0][15:40:42] xinjie@devmac0e0:/thinker/dstore/gene/glad/conf $ ll sage_key -rw-r--r--. 1 dstore dstore 1679 Sep 8 14:57 sage_key
I set the owner to be dstore. Then user gene, sage, xinjie can use the key to access sage_portal. Only user dstore can not do that.

If the user is not the owner but can read the file and other users can read the key file, too, why does SSH allow the user to use the key? This seems to be a bug in SSH.

Is the behavior dependable? If the current behavior is not well defined, we may not want to rely on it although it works now. In the future, if the behavior changes, our programs will break.

Can we use multiple key files of the same content, i.e., key-sage, key-xinjie, ..., so that the key file's access can be restricted to that particular user?

In general, avoid hacks unless there is no other way out. Keep things simple stupid. Hacks are fun, but it's not dependable. It's like sleeping a drunk girl -- if anything lasted after the fun through the next sunrise it would be some mess of drying sperm mixed with vomits. That's not beauty of life.

(09-07-2016 05:35 PM)lingu Wrote: [ -> ]
Quote:1.
require sage_user can password-less login sage_user@sage_portal

This is fine if all are sage_user.

Quote:copy private key of sage_user@sage_portal into $gb/conf/sage_key

$gb is owned by glad_user and all operative users can access it. So this is a security vulnerability. OK now. But make a TODO to fix it later.

I think I may not have really understood what we are doing here. Are we trying to let glad_user ssh to sage_user@sage_portal? Or are we trying to let sage_user ssh back to sage_user@sage_portal?

For the latter, we can perhaps let all sage_user have the same key on all nodes? But the keys may not be stored in $gb.

For the former, we should perhaps add glad_user's pub key to sage_user@portal's authorized_keys during the installation of glad?

Pages: 1 2 3 4 5 6