Post Reply 
 
Thread Rating:
  • 0 Votes - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Service management of Sage - D
07-12-2016, 05:54 PM (This post was last modified: 11-23-2019 11:41 PM by lingu.)
Post: #1
Service management of Sage - D
The src of the sage program is cod://sage/src/service/bin/sage

goal

1. the script can be used on sage_user@sage_portal to start/stop/status sage.
2. the script should not couple GLAD.

The script would replace the current start.sh/stop.sh/status.sh.

sage main logiccommon var import
Code:
modname = sage
base=current dir
LearnPcf /thinker/etc/soft/$modname/mic.pcf
LearnPcf $base/config.pcf
ips=$base/hosts.ips
set timeout to be half of the number of lines in $ips or 3 if $ips does not exist

parse options
run command
  start
  status
  ps
  tasks
  help

sage start

Start a sage_screen to run script start_sage.
start_sage script looks like this:
Code:
trap ctrl_c

if type ctrl_c; then
    echo "You type ctrl_c to terminate program"
    kill DT
    clean all object files and exit

do_start() {
    ./detect_listen.sh &
    make run, which would calls `dt run sage-svc $sage_base/config.pcf`
}

if the 1st parameter  == 'nohup'; then
    nohup do_start() > $sage_stdout 2>&1 &
else
    do_start() 2>&1

Use auntie/ps to check whether sage is started correctly.
If start successfully, wake_sage script is executed in a screen. wake_sage would periodically output the tasks graph info into $sage_base/stdout/tasks_graph.tmp.<timestamp>.

sage stop

Use 'dt slay' to stop DT.
Kill the sage_screen.
Use ps to check whether sage is stopped correctly.

sage status

"dexer auntie" to check the status of sage.

sage ps

Use "screen -ls" to get the sage_service screens info.
Use "dexer 'ps aux' | grep -E 'vpc|nrc|scheduler|mem_home|atpd|ncat|xfer|sgl|: ps aux'" to get the processes info.


sage tasks
sage tasks invokes function tasks_sage():
Code:
local tasks_graph_file=$base/stdout/tasks_graph.tmp
while True {
  while True {
    cnt = count ${tasks_graph_file}*;
    sleep 5 if cnt>1 || cnt==0;
    otherwise break;
  }
  call auntie ##To wake up sage to collect info in all nodes.
  check timestamp;
  cat $sage_base/stdout/tasks_graph.tmp.<timestamp>
  update timestamp;
}

---------
20191123/lingu: move tasks out.
20190511/cwt: Add auntie call.
20190510/lingu: timeout default 3.
20190509/cwt: Add common var import. Add timeout, IP count etc.
20190508/lingu: add src location.
20160927/yxj: decouple sage service script with glad.
Quote this message in a reply
09-07-2016, 02:00 PM
Post: #2
RE: Service management of Sage
(07-12-2016 05:54 PM)YU_Xinjie Wrote:  This thread is the design of service management of Sage

The usage of service management is in
http://tab.d-thinker.org/showthread.php?tid=6164
http://tab.d-thinker.org/showthread.php?tid=6197

Format PPM

./format.sh
Code:
dt format -n NVPC k

Load metadata

./load.sh
Code:
set $cfgfile, $nodefile,  search file prefix($prefix).
echo 'Greppy system is starting.....'
tr '\n' from the $nodefile

function get_basic_info:
    get vpcs,space,ip,slot,filename, nf(how many parts one line divided by space) info from $nodefile, all vars are array var.

function make_link:
parameter: $file need to search
    if [ ! -f $file on $ip ] :
          echo '$file not on $ip" and exit
    else:
          make links with $file into  thinker stdin dir.

function gain_config() :
printf $space, $filesize, $dejavusize into $cfgfile

totalline = len of $nodefile

function gen_config()
mv vpc.$num and nrcs.$num to thinker runtime dir.

for i in range(0,totalline):
    get_basic_info()
    $host[i] = $ip
    if [ $nf -eq 2 ]  ##means file need to search
      make_link
      gain_config
   else
      print $space, 0, 0, 'n' to $cfgfile
      
gen_config()
call make datacfg to run size prog
call make read to run read prog

Start service

./start.sh
Code:
trap ctrl_c

if type ctrl_c; then
    echo "You type ctrl_c to terminate program"
    kill DT
    clean all object files and exit

if the 1st parameter  == 'nohup'; then
    nohup do_start &
else
    do_start

function do_start() {
    make run
}

Stop service

./stop.sh
Code:
dt slay -n NVPC k

Show status

./status.h
Code:
dexer --ips=$ips "ps aux" | grep -E 'vpc|nrc|scheduler|mem_home|atpd|auntie|ncat' | grep 'sage' | grep -v 'dexer' | grep -v 'grep'

Saved a copy before change.
Quote this message in a reply
09-07-2016, 02:47 PM
Post: #3
RE: Service management of Sage
@lingu

Please review the design, so that I can implement it.
Quote this message in a reply
09-07-2016, 03:53 PM
Post: #4
RE: Service management of Sage
It seems to be quite complex. Do we need to really pay that complexity? If so, please let me know and I'll review the design in detail.

Here is what I think we can do:

1. Start/stop sage using the current way (start.sh and kill).

2. An operative user may not be able to run start.sh and kill as the user 'sage'. Hence, there should be some mechanism. For example, can we create a user 'gladsage' whose login shell is a command that uses a key file to login as sage and run start.sh or kill.
Find all posts by this user
Quote this message in a reply
09-07-2016, 03:55 PM
Post: #5
RE: Service management of Sage
We still need to handle race condition for running multiple sage thinkers. But I think thinkers already have such conflict reporting capability. We just need to track the return status of start.sh and report an error as well as stop glad when errors occur.

If start.sh does not report the return status correctly, we need to perhaps improve start.sh.
Find all posts by this user
Quote this message in a reply
09-07-2016, 03:58 PM
Post: #6
RE: Service management of Sage
(07-12-2016 05:54 PM)YU_Xinjie Wrote:  The brief idea is to use auntie to send a file to glad_portal. Then check the file content.

Directly running auntie without argumets can already check if sage is present.

Quote:1.
require sage_user can password-less login sage_user@sage_portal
copy private key of sage_user@sage_portal into $gb/conf/sage_key

Then glad operative user & sage_user can use sage_key to execute remote command to sage_user@sage_portal.

This is a key design part. I agree with your design as a quick implementation.

But it has a security hazard -- if an op user can access the key, he/she can run anything as sage. Hence, please consider the 'gladsage' solution I wrote about when you have time to refine the solution.
Find all posts by this user
Quote this message in a reply
09-07-2016, 04:11 PM
Post: #7
RE: Service management of Sage
(09-07-2016 03:53 PM)lingu Wrote:  It seems to be quite complex. Do we need to really pay that complexity? If so, please let me know and I'll review the design in detail.

The code I post is actually the script I develop for shbio. It works well for more than one week.
It is complex but robust.

I hope you can check it, so that I can put it into our glad/sage code tree.
You do not need to check every options of every command, because they work. But you'd better check the brief method. In brief, they are just a remote cmd wrapper for the start.sh/stop.sh/status.sh, except the function wait_sage_start, which uses auntie to check sage.

(09-07-2016 03:55 PM)lingu Wrote:  We still need to handle race condition for running multiple sage thinkers. But I think thinkers already have such conflict reporting capability. We just need to track the return status of start.sh and report an error as well as stop glad when errors occur.

If start.sh does not report the return status correctly, we need to perhaps improve start.sh.

Good point. I will check it.

(09-07-2016 03:58 PM)lingu Wrote:  
(07-12-2016 05:54 PM)YU_Xinjie Wrote:  The brief idea is to use auntie to send a file to glad_portal. Then check the file content.

Directly running auntie without argumets can already check if sage is present.

It does not work. I already use the latest auntie. I report it in this thread: http://tab.d-thinker.org/showthread.php?tid=6506
Quote this message in a reply
09-07-2016, 05:06 PM (This post was last modified: 09-07-2016 05:07 PM by lingu.)
Post: #8
RE: Service management of Sage
(09-07-2016 04:11 PM)YU_Xinjie Wrote:  
(09-07-2016 03:53 PM)lingu Wrote:  It seems to be quite complex. Do we need to really pay that complexity? If so, please let me know and I'll review the design in detail.

The code I post is actually the script I develop for shbio. It works well for more than one week.
It is complex but robust.

I hope you can check it, so that I can put it into our glad/sage code tree.
You do not need to check every options of every command, because they work. But you'd better check the brief method. In brief, they are just a remote cmd wrapper for the start.sh/stop.sh/status.sh, except the function wait_sage_start, which uses auntie to check sage.

I will review. But this is TERRIBLE practice. Please make sure to get endorsement from another engineer before starting to implement an important function. If you ask me to review but I don't respond, please remind me for an important review.

If you need to spend half a day to implement some code, you should be prudent in writing the code directly. Instead, invite reviews. We may find a way to solve the problem in 2 hours.

If there is no response after reminders, you may decide to go ahead with implementing an important piece of code.

If it takes only 30min to implement something, you can go ahead and implement it without endorsement if you don't want to wait. If the change is insignificant, you may also skip the review.

It is generally not practical to define what is "important" or "significant". We rely on engineers' judgement.

This one -- how to start/stop sage and make it work with glad -- is certainly an important piece of design.
Find all posts by this user
Quote this message in a reply
09-07-2016, 05:18 PM
Post: #9
RE: Service management of Sage
(09-07-2016 05:06 PM)lingu Wrote:  I will review. But this is TERRIBLE practice. Please make sure to get endorsement from another engineer before starting to implement an important function. If you ask me to review but I don't respond, please remind me for an important review.

If you need to spend half a day to implement some code, you should be prudent in writing the code directly. Instead, invite reviews. We may find a way to solve the problem in 2 hours.

If there is no response after reminders, you may decide to go ahead with implementing an important piece of code.

If it takes only 30min to implement something, you can go ahead and implement it without endorsement if you don't want to wait. If the change is insignificant, you may also skip the review.

It is generally not practical to define what is "important" or "significant". We rely on engineers' judgement.

This one -- how to start/stop sage and make it work with glad -- is certainly an important piece of design.

Okay. I agree these rules. Will follow them.
Quote this message in a reply
09-07-2016, 05:35 PM
Post: #10
RE: Service management of Sage
(07-12-2016 05:54 PM)YU_Xinjie Wrote:  goal

1. the script can be used in both sage_portal & glad_portal to start/stop/status sage.
2. user can start sage before glad, and stop sage after glad.

Sage needs to be started automatically by glad for a batch of samples (dataflows).

It is okay to have a user manually start it at present. But this is not the way we want for the future. It adds a human operation. An extra human operation makes the software a bit less easy to use, and a whole lot easier to make a mistake.

So, when designing, you can make compromises now, but keep in mind what we want in the long run.

Quote:The script would replace the current start.sh/stop.sh/status.sh in the future.
The script may inspire us about how to set up regression tests for sage.

This is good thinking -- using one way, not two ways, to do one task.

Quote:[quote]
For shbio's project, I have already written a script /thinker/dstore/run/sage to support a part of the goal. I would try to generalize it.

config design

The brief idea is to use auntie to send a file to glad_portal. Then check the file content.

This is too heavyweight -- it may even disrupt Sage if it is busy.

Sending a file takes 2-3s. This is considered long/heavyweight.

OK to keep the current design if it has worked. But please be aware of latency in a pathetic way -- otherwise our software would be as slow as Hadoop in 2 years.

Quote:1.
require sage_user can password-less login sage_user@sage_portal

This is fine if all are sage_user.

Quote:copy private key of sage_user@sage_portal into $gb/conf/sage_key

$gb is owned by glad_user and all operative users can access it. So this is a security vulnerability. OK now. But make a TODO to fix it later.

Quote:Then glad operative user & sage_user can use sage_key to execute remote command to sage_user@sage_portal.

2. The glad_portal IP can be get from $gb/conf/config.sh

How (a convar like glad_portal?) Have we made sure such a varaible must exist in $gb/conf/config.sh. If not, please add a note there "need implement" so that we know to create a TODO item and implement this later after this design is endorsed.

Quote:3.
create a config file $gb/conf/sage_portal_ip to record the IP of sage_portal. Or we can also write it into $gb/conf/config.sh.

I prefer to create a convar like glad_sage_user and store it in $gb/conf/config.sh

Quote:4. let sage_user@sage_portal can password-less login glad_user@glad_portal.

That's fine -- we assume sage_user is no a human operator and can be a powerful guy.

Quote:start sage

Code:
remote_cmd () {
ssh -q -o StrictHostKeyChecking=no -i $gb/conf/sage_key $sage_user@$sage_portal "$1"
}
[/quote]

Don't have to specify this -- everybody can find a way to run a remote program without being tremendously wrong. So it's not a key design detail.

We require engineers use psudocode to write down the design if it involves some logic. But code is not appropriate psudocode.

[quote]
wait_sage_start () {
content="This is test content."
echo $content > $curdir/${USER}_sage_testfile
test_file=/thinker/bin/ephemeral/667.sagetest

# two minutes timeout
try_cnt=40
while [[ ("$try_cnt" -gt "0") && ((! -f "$test_file") || ("$(cat $test_file)" != "$content")) ]]; do
  timeout -s SIGINT 1 bash -c "sage_user=$sage_user $curdir/auntie ^$curdir/${USER}_sage_testfile $glad_portal 667 sagetest" || echo -n ''
  echo "Trying to start Sage..."
  sleep 2
  try_cnt=$((try_cnt-1))
done
rm -f $test_file
}
[/quote]

Too long wait. If it does not work in 30 sec, report an error and stop.

[quote]
start_sage () {
# check whether sage is really stopped.
local tmp="$(status_sage)"
if [[ "$tmp" != "Sage is stopped." ]]; then
  echo "Sage is already started."
  echo ""
  echo "Status: "
  echo "$tmp"
  return 0
fi

remote_cmd "screen -dmS sage_service bash -c 'cd ~/sage; ./start.sh'"
wait_sage_start
echo "Sage is started."
}

What if the wait_sage_start fails?

We need a way to check Sage has started. I thought this is what I thought you were doing. But it looks like here if auntie continues to fail the wait still completes without an error halt.

Quote:stop sage

Code:
stop_sage () {
# stop sage
remote_cmd "cd ~/sage; ./stop.sh" > /dev/null

# kill all sage_service screens
remote_cmd "
for session in \$(screen -ls | grep 'sage_service' | grep -o '[0-9]\+\.' | grep -o '[0-9]\+')
do
    screen -S \${session} -p 0 -X quit
done
"
[/quote]

Looks good.

[quote]
# check whether sage is really stopped.
local tmp="$(status_sage)"
if [[ "$tmp" == "Sage is stopped." ]]; then
  echo "Sage is stopped."
else
  echo "Sage failed to stop."
  echo ""
  echo "Status: "
  echo "$tmp"
fi
}

OK.

Quote:status sage

Code:
status_sage () {
local output="$(remote_cmd "cd ~/sage; ./status.sh")"
if [[ "$output" == "Sage is stopped." ]]; then
  echo "Sage is stopped."
else
  echo "$output"
fi
}

OK. In the future, we should consider using autnie to directly query Sage to get the status.
Find all posts by this user
Quote this message in a reply
Post Reply 


Forum Jump: