08-08-2021, 12:35 PM
zhihao
08-08-2021, 12:38 PM
I start bibo on limbo305-1 and start sage on limbo305-3, then I test delete but failed
log reports
then I update toneroot of procone.py to /thinker/globe/soft/bibo/procuratorate/cases , delete test reports can't open file
find no res2 dir, I manually create it then case 52 can be deleted
later I add below in procone.py
Code:
[root@limbo305-1 test]# ./test_insert_zeng.sh
sending insert request ./processed/zengxingliang-newkey.json
return code: 200
insert success
[root@limbo305-1 test]# ./test_delete_zeng.sh
sending delete request /thinker/local/soft/bibo/app/test/processed/del_zengxingliang.json to localhost:62818
delete FAILED
[root@limbo305-1 test]#
log reports
Code:
[root@limbo305-1 sage]# less procone-49.log
[2021-08-08 13:50:33] procone starts
tycano: 49
more args
keys in /thinker/globe/soft/bibo/procuratorate/tests/typical/49/insert.json
There is error when opening file /thinker/globe/soft/bibo/procuratorate/tests/typical/49/insert.json
procone-49.log (END)
then I update toneroot of procone.py to /thinker/globe/soft/bibo/procuratorate/cases , delete test reports can't open file
Code:
[2021-08-08 14:11:31] procone starts
tycano: 51
more args
keys in /thinker/globe/soft/bibo/procuratorate/cases/typical/51/content
keys ready
tyca json loaded.
opening intermediate file /thinker/globe/soft/bibo/procuratorate/cases/res2/res_51.txt-tmp
find no res2 dir, I manually create it then case 52 can be deleted
Code:
sage@limbo305-1 test]$ /thinker/local/soft/bibo/plug/procone.py --tycano 52 --reqtype delete --reqcaseid "67be261c9e6311eabceb005056c00001"
[sage@limbo305-1 test]$ /
[code]
[root@limbo305-1 test]# cat /thinker/local/soft/bibo/app/test/processed/del_zengxingliang.json; sleep 18 | ncat localhost 62818
{"type":"delete","caseid":"67be261c9e6311eabceb005056c00001"
}
[root@limbo305-1 test]# echo $?
0
later I add below in procone.py
Code:
toneroot = "/thinker/globe/soft/bibo/procuratorate/cases"
bibofast = "/thinker/fastdata/bibo"
bibe = bibofast + "/e"
reslocal = bibe + "/res2"
zhihao
08-08-2021, 07:56 PM
inquire will stuck because yfs has some inactive pg
3 unknown pgs are in cephfs_metadata, then I reinstall yotta on limbo305-1
Code:
[root@limbo305-1 test]# ./test_inquire_zeng.sh
sending inquire request ./processed/zengxingliang-inquire.json to localhost:62818
^C
[root@limbo305-1 test]# ls /thinker/globe/soft/bibo/procuratorate/cases/sum2/
^C
[root@limbo305-1 test]#
[root@limbo305-1 test]# ceph -s
cluster:
id: 33d940a8-7e68-44f3-bc37-305aaaabbbbc
health: HEALTH_ERR
1 clients failing to respond to capability release
1 MDSs report slow metadata IOs
1 MDSs report slow requests
mons limbo305-1,limbo305-2,limbo305-3 are low on available space
1 monitors have not enabled msgr2
2/510 objects unfound (0.392%)
Reduced data availability: 36 pgs inactive, 33 pgs incomplete
Possible data damage: 1 pg recovery_unfound
Degraded data redundancy: 6/1530 objects degraded (0.392%), 1 pg degraded
32 pgs not deep-scrubbed in time
32 pgs not scrubbed in time
195 slow ops, oldest one blocked for 81807 sec, daemons [osd.5,osd.7] have slow ops.
services:
mon: 3 daemons, quorum limbo305-1,limbo305-2,limbo305-3 (age 23h)
mgr: limbo305-1(active, since 32m)
mds: cephfs:1 {0=limbo305-3=up:active} 1 up:standby
osd: 12 osds: 12 up (since 22h), 12 in (since 22h)
task status:
scrub status:
mds.limbo305-3: idle
data:
pools: 4 pools, 97 pgs
objects: 510 objects, 182 MiB
usage: 13 GiB used, 1.1 TiB / 1.1 TiB avail
pgs: 3.093% pgs unknown
34.021% pgs not active
6/1530 objects degraded (0.392%)
2/510 objects unfound (0.392%)
60 active+clean
32 creating+incomplete
3 unknown
1 active+recovery_unfound+degraded
1 incomplete
3 unknown pgs are in cephfs_metadata, then I reinstall yotta on limbo305-1
zhihao
08-10-2021, 12:50 AM
install sage stuck
log shows some files not found
sage start reports need to configure first.
sage start reports irresponsive
log shows
Code:
[root@limbo305-3 ~]# decent_init=True /thinker/local/forest/util/utilib/installx sage
/thinker/local/shed/installation/limbo305-tc
Installing /thinker/local/shed/installation/limbo305-tc/sage.tar.gz ...
log shows some files not found
Code:
...
10.36.3.51:3124 (slot 19)
Start to run the program with 72 VPCs.
Program execution completed.
protocol error: filename does not match request
protocol error: filename does not match request
protocol error: filename does not match request
cat: '/home/sage/think/run/results/result-72-sage/stdout-*': No such file or directory
cat: '/home/sage/think/run/results/result-72-sage/stdout-*': No such file or directory
...
=== Begin Sage Service Test Set ===
./sage_service/sage-service-start.sh
cat: /home/sage/sage/run/service.pid: No such file or directory
cat: /home/sage/sage/run/aide.pid: No such file or directory
Sage is stopped.
lockstep mark /thinker/globe/.think/lockstep//limbo305-3/sage/tested
lockstep mark /thinker/globe/.think/lockstep//limbo305-3/sage/decent/workdone
sage start reports need to configure first.
Code:
[sage@limbo305-3 ~]$ ./sage/bin/sage start
ERROR: Sage has not been configured. Please run /home/sage/sage/bin/configure at sage_portal first.
[sage@limbo305-3 ~]$
[sage@limbo305-3 correctness]$ /home/sage/sage/bin/configure
====== Generating config.pcf ======
inferred and exported:
sage_base: /home/sage/sage
sage_stdout: /home/sage/sage/stdout/sage.out
sage_debug: 0
helper_portal: 10.36.1.49
sage_ipx: /thinker/etc/ips.cfg
sage_atp_cnt: 60
generated /home/sage/sage/config.pcf
====== Generating config.node, sage-svc.inc & hosts.ips ======
-- generating config.node & sage-svc.inc by gen_sage_config
10.36.1.49
10.36.2.50
10.36.3.51
HELPER_CNT: 6
HEAVEN_CNT: 1
ATP_CNT: 60
backup old config.node: /home/sage/sage/config.node.105847
backup old sage-svc.inc: /home/sage/sage/sage-svc.inc.105847
Generating walkers from /thinker/etc/ips.cfg
New node: 10.36.1.49
New node: 10.36.2.50
New node: 10.36.3.51
Add fillers
10.36.1.49:1
10.36.1.49:51
10.36.1.49:2
10.36.1.49:51
10.36.1.49:3
10.36.1.49:51
10.36.1.49:4
...
10.36.1.49:51
Adding sage_atp_staff_space_cnt (3), sage_atp_staff_space_start (20) & sage_atp_head_space (14) to config.pcf
Arguments (Part 2):
WALKER_CNT: 3
FILLER_CNT: -1
sage_atp_staff_space_start: 20
sage_atp_staff_space_cnt: 3
sage_atp_head_space: 14
config.node is generated in /home/sage/sage/config.node successfully.
The old config.node is backuped in /home/sage/sage/config.node.105847.
sage-svc.inc is generated in /home/sage/sage/sage-svc.inc successfully.
The old svc_inc is backuped in /home/sage/sage/sage-svc.inc.105847.
-- generating hosts.ips from config.node
====== Begin to do the original load.sh ======
Parsing config.node
Configuring for 72 containers
10.36.3.51 slots: 52
10.36.1.49 slots: 54 55 56 57 58 59 60 52
10.36.2.50 slots: 52
10.36.3.51 slots: 53
10.36.1.49 slots: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
10.36.2.50 slots: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
10.36.3.51 slots: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
10.36.1.49 slots: 51 50Obtaining config info
Generating config
Sage is going to format the persistent memory of DT after 10 seconds. Please type Ctrl-C if you do not want to do it.
WARNING: Seems the thinker is running. Refused to format the persistent memory.
This may happen because:
a) Someone else is using the thinker.
b) Your program fails or you canceled you program by 'Ctrl-C'.
If you are the one who reserve the thinker for the current time range,
or you are sure that the thinker is running your program, you can
kill the thinker by:
$ dt slay
[sage@limbo305-3 correctness]$ dt slay
...
[sage@limbo305-3 correctness]$ /home/sage/sage/bin/configure
,,,
10.36.3.51:3116 (slot 17)
10.36.3.51:3120 (slot 18)
10.36.3.51:3124 (slot 19)
Start to run the program with 72 VPCs.
Program execution completed.
protocol error: filename does not match request
protocol error: filename does not match request
protocol error: filename does not match request
cat: '/home/sage/think/run/results/result-72-sage/stdout-*': No such file or directory
cat: '/home/sage/think/run/results/result-72-sage/stdout-*': No such file or directory
====== configure is done ====== [0810-11:17:45]
[sage@limbo305-3 correctness]$ echo $?
0
[sage@limbo305-3 correctness]$
sage start reports irresponsive
Code:
[sage@limbo305-3 correctness]$ ~/sage/bin/sage start
Sage is irresponsive. Trying to stop it before start it again.
cat: /home/sage/sage/run/aide.pid: No such file or directory
The prior service instance is perhaps 267779
Sage is stopped.
Starting Sage ..............
log shows
Code:
10.36.1.49 "killall wake_sage detect_listen auntie atpd atpa 2> /dev/null":
10.36.2.50 "killall wake_sage detect_listen auntie atpd atpa 2> /dev/null":
10.36.3.51 "killall wake_sage detect_listen auntie atpd atpa 2> /dev/null":
Sage is stopped.
sage is to run nohup bash -c 'cd /home/sage/sage/bin; ./start_sage 2>&1 | tee -a /home/sage/sage/stdout/sage.out'
stty: 'standard input': Inappropriate ioctl for device
store sage_service pid 271612
start_sage: Tue Aug 10 11:21:10 CST 2021: Sage starts
detect_listen.sh: no process found
[Tue Aug 10 11:21:13 CST 2021] status sage 10.36.1.49 7
sage state:Sage is stopped or irresponsive. vs. started state:Sage is started.
[Tue Aug 10 11:21:20 CST 2021] status sage 10.36.1.49 7
sage state:Sage is stopped or irresponsive. vs. started state:Sage is started.
08-13-2021, 09:30 AM
(08-08-2021 12:38 PM)zhihao Wrote: [ -> ]I start bibo on limbo305-1 and start sage on limbo305-3, then I test delete but failed
Code:
[root@limbo305-1 test]# ./test_insert_zeng.sh
sending insert request ./processed/zengxingliang-newkey.json
return code: 200
insert success
[root@limbo305-1 test]# ./test_delete_zeng.sh
sending delete request /thinker/local/soft/bibo/app/test/processed/del_zengxingliang.json to localhost:62818
delete FAILED
[root@limbo305-1 test]#
we should use the standardized way -- 'make test'.
i added a cases var so that we can specify which test to run.
Code:
[sage@limbo305-1 test]$ make test cases=delete_zeng
Testing delete_zeng
run test ./test_delete_zeng.sh
sending delete request /thinker/local/soft/bibo/app/test/processed/del_zengxingliang.json to localhost:62818
Quote:then I update toneroot of procone.py to /thinker/globe/soft/bibo/procuratorate/cases , delete test reports can't open file
should update the module if you do it on limbo305. file copying is okay on wp289 but it is not recommended generally.
08-13-2021, 09:37 AM
insert is not very stable.
zhihao pls investigate this issue.
Code:
[sage@limbo305-1 test]$ make test cases=insert_zeng
Testing insert_zeng
run test ./test_insert_zeng.sh
sending insert request ./processed/zengxingliang-newkey.json
return code: 416
insert failed
make: *** [Makefile:22: test] Error 93
[sage@limbo305-1 test]$ make test cases=insert_zeng
Testing insert_zeng
run test ./test_insert_zeng.sh
sending insert request ./processed/zengxingliang-newkey.json
return code: 200
insert success
[sage@limbo305-1 test]$
zhihao pls investigate this issue.
08-13-2021, 09:39 AM
delete fails
and sage crashes.
i changed sshd_config, rebooted the vm and mounted yfs manually. then sage does not crash but delete still fails.
atpa log shows
efs dir is not installed on limbo305-2
i re-trun startbibo, and the dir is created.
Code:
[sage@limbo305-1 test]$ make test cases=delete_zeng
Testing delete_zeng
run test ./test_delete_zeng.sh
sending delete request /thinker/local/soft/bibo/app/test/processed/del_zengxingliang.json to localhost:62818
delete FAILED
make: *** [Makefile:22: test] Error 255
[sage@limbo305-1 test]$
and sage crashes.
Code:
10.36.3.51:3112 (slot 16)
10.36.3.51:3116 (slot 17)
10.36.3.51:3120 (slot 18)
10.36.3.51:3124 (slot 19)
10.36.3.51:3128 (slot 20)
Start to run the program with 82 VPCs.
Greppy service is running, You can use client tools for searching now
ERROR: VPC reports ABORT_ERROR
ERROR: VPC reports ABORT_ERROR
ERROR: VPC 0x24023204 (10.36.2.50:3064) reports ABORT_ERROR:
Panic: addr out of bound.
ERROR: VPC 0x2402320e (10.36.2.50:3104) reports ABORT_ERROR:
Panic: addr out of bound.
[1628815250.887261s]
ERROR: VPC 0x2402320e (10.36.2.50:3104) reports ABORT_ERROR:
Panic: addr out of bound.
[1628815250.887309s] ERROR: VPC reports ABORT_ERROR
ERROR: VPC reports ABORT_ERROR
ERROR: VPC 0x24023208 (10.36.2.50:3080) reports ABORT_ERROR:
Panic: addr out of bound.
[1628815250.929026s]
ERROR: VPC 0x24023208 (10.36.2.50:3080) reports ABORT_ERROR:
Panic: addr out of bound.
[1628815250.929104s] ERROR: VPC reports ABORT_ERROR
ERROR: VPC reports ABORT_ERROR
ERROR: VPC 0x2402320d (10.36.2.50:3100) reports ABORT_ERROR:
Panic: addr out of bound.
[1628815250.935846s]
ERROR: VPC 0x2402320d (10.36.2.50:3100) reports ABORT_ERROR:
Panic: addr out of bound.
[1628815250.935908s] ERROR: VPC reports ABORT_ERROR
ERROR: VPC reports ABORT_ERROR
ERROR: VPC 0x24023201 (10.36.2.50:3052) reports ABORT_ERROR:
Panic: addr out of bound.
i changed sshd_config, rebooted the vm and mounted yfs manually. then sage does not crash but delete still fails.
Code:
[sage@limbo305-1 test]$ make test cases=delete_zeng
Testing delete_zeng
run test ./test_delete_zeng.sh
sending delete request /thinker/local/soft/bibo/app/test/processed/del_zengxingliang.json to localhost:62818
delete FAILED
make: *** [Makefile:22: test] Error 255
[sage@limbo305-1 test]$
atpa log shows
Code:
[root@limbo305-2 sage]# pwd
/thinker/local/today/users/sage
[root@limbo305-2 sage]# tail atpa.log-10
2021.08.13 9:0:1 (ffffffa2): atpa.runx creating ..- /thinker/fastdata/bibo/e/res2//0.. 0x0 0xa 0x1
2021.08.13 9:0:1 (ffffffa2): atpa.runx creating ..- /thinker/fastdata/bibo/e/res2//0/10.. 0x0 0xa 0x1
2021.08.13 9:0:1 (ffffffa2): atpa::runx() kissing : 10
2021.08.13 9:0:1 (ffffffa2): atpd::KissFromSage: k i len, st_size: 10 6693
2021.08.13 9:0:1 (ffffffa2): atpd::KissFromSage: Get ret : 10 1
2021.08.13 9:0:1 (ffffffa2): KissFromSage: varda5 format, len..- {"typ.. 0x1 0x1a25 0x0
2021.08.13 9:0:1 (ffffffa2): creating dir /thinker/fastdata/bibo/e/res2//0
2021.08.13 9:0:1 (ffffffa2): mkdir ..- /thinker/fastdata/bibo/e/res2//0.. 0xffffffffffffffff 0x0 0x0
2021.08.13 9:0:1 (ffffffa2): creating dir /thinker/fastdata/bibo/e/res2//0/10
2021.08.13 9:0:1 (ffffffa2): mkdir returns - ffffffffffffffff 0 0 0
[root@limbo305-2 sage]#
efs dir is not installed on limbo305-2
Code:
[root@limbo305-2 sage]# dexer 'ls /thinker/bin/ephemeral/'
10.36.1.49: ls /thinker/bin/ephemeral/
bibo
10.36.2.50: ls /thinker/bin/ephemeral/
10.36.3.51: ls /thinker/bin/ephemeral/
bibo
[root@limbo305-2 sage]#
i re-trun startbibo, and the dir is created.
zhihao
08-13-2021, 11:30 AM
(08-13-2021 09:37 AM)lingu Wrote: [ -> ]insert is not very stable.
Code:
[sage@limbo305-1 test]$ make test cases=insert_zeng
Testing insert_zeng
run test ./test_insert_zeng.sh
sending insert request ./processed/zengxingliang-newkey.json
return code: 416
insert failed
make: *** [Makefile:22: test] Error 93
[sage@limbo305-1 test]$ make test cases=insert_zeng
Testing insert_zeng
run test ./test_insert_zeng.sh
sending insert request ./processed/zengxingliang-newkey.json
return code: 200
insert success
[sage@limbo305-1 test]$
zhihao pls investigate this issue.
ok, but not get wrong return code yet
Code:
[root@limbo305-1 test]# make test cases=insert_zeng
Testing insert_zeng
run test ./test_insert_zeng.sh
sending insert request ./processed/zengxingliang-newkey.json
return code: 200
insert success
[root@limbo305-1 test]# make test cases=insert_zeng
Testing insert_zeng
run test ./test_insert_zeng.sh
sending insert request ./processed/zengxingliang-newkey.json
return code: 200
insert success
[root@limbo305-1 test]# ./test_insert_zeng.sh
sending insert request ./processed/zengxingliang-newkey.json
return code: 200
insert success
[root@limbo305-1 test]# ./test_insert_zeng.sh
sending insert request ./processed/zengxingliang-newkey.json
return code: 200
insert success
[root@limbo305-1 test]# ./test_insert_zeng.sh
sending insert request ./processed/zengxingliang-newkey.json
return code: 200
insert success
[root@limbo305-1 test]# ./test_insert_zeng.sh
sending insert request ./processed/zengxingliang-newkey.json
return code: 200
insert success
[root@limbo305-1 test]#
[root@limbo305-1 test]# ./test_insert_zeng.sh
sending insert request ./processed/zengxingliang-newkey.json
return code: 200
insert success
[root@limbo305-1 test]#
08-13-2021, 11:33 AM
(08-13-2021 11:30 AM)zhihao Wrote: [ -> ]ok, but not get wrong return code yet
you need run a stress test to reproduce such issues.
dont do it now as we are making it correct first. do it later.
a1. reproduce the problem
a1.1 make a stress test for insert
a1.2 run the stress test to reproduce the problem
a2. fix the problem.
zhihao
08-13-2021, 11:42 AM
make test failed
find utilib error, because it can't find file "/thinker/etc/soft/sites/limbo305-tc/bibo.pcf"
Code:
[root@limbo305-1 test]# make test cases=delete_testcase
Testing delete_testcase
run test ./test_delete_testcase.sh
sending delete request /thinker/local/soft/bibo/app/test/processed/del_test_case.json to localhost:62818
delete FAILED
make: *** [Makefile:22: test] Error 255
[root@limbo305-1 test]#
find utilib error, because it can't find file "/thinker/etc/soft/sites/limbo305-tc/bibo.pcf"
Code:
[root@limbo305-2 ~]# su sage
[sage@limbo305-2 root]$ /thinker/local/soft/bibo/plug/procone2.py --qid 0 --tycano 19
Traceback (most recent call last):
File "/thinker/local/soft/bibo/plug/procone2.py", line 23, in <module>
rb, gmic = worksite.learnSiteMic(modname="bibo")
File "/thinker/local/forest/util/utilib/worksite.py", line 88, in learnSiteMic
print >> sys.stderr, "Unable to locate configuration file " + micpfn
NameError: global name 'micpfn' is not defined
[sage@limbo305-2 root]$