Go中复制map可能遇到坑

昨天下午遇到一个问题。服务A与服务B批量建立tcp连接,并定时发送rpc请求(rpc.Ping())检查连接的健康状态。A服务在kibana出现报错:

1use of closed network connection

第一反应是B服务重启导致连接断开,正好落在A服务检查出错与重连的时间窗口。运维同学帮拉日志,发现错误:

 1{"log":"fatal error: concurrent map iteration and map write\n","stream":"stderr","time":"2019-11-01T02:52:52.771394863Z"}
 2{"log":"[11/01/19 10:52:52] [DEBG] RouterRPC Put Request =\u003e \u0026{6076362058 1 200178629686 2fadfb7c24bb440cb9dfcb74f515baef server:1:currt:1572229900380812731:498081}\n","stream":"stdout","time":"2019-11-01T02:52:52.771452423Z"}
 3{"log":"\n","stream":"stderr","time":"2019-11-01T02:52:52.773643972Z"}
 4{"log":"goroutine 62906840 [running]:\n","stream":"stderr","time":"2019-11-01T02:52:52.773666216Z"}
 5{"log":"runtime.throw(0x932fd7, 0x26)\n","stream":"stderr","time":"2019-11-01T02:52:52.773672033Z"}
 6{"log":"\u0009/app/go1.9/go/src/runtime/panic.go:605 +0x95 fp=0xc4202019a8 sp=0xc420201988 pc=0x42d1a5\n","stream":"stderr","time":"2019-11-01T02:52:52.773676889Z"}
 7{"log":"runtime.mapiternext(0xc420201ac8)\n","stream":"stderr","time":"2019-11-01T02:52:52.773682159Z"}
 8{"log":"\u0009/app/go1.9/go/src/runtime/hashmap.go:778 +0x6f1 fp=0xc420201a40 sp=0xc4202019a8 pc=0x40b7c1\n","stream":"stderr","time":"2019-11-01T02:52:52.773686889Z"}
 9{"log":"main.(*RouterRPC).AllRoomCount(0xc4201988c0, 0xc420b08630, 0xc420192158, 0x0, 0x0)\n","stream":"stderr","time":"2019-11-01T02:52:52.773694133Z"}
10{"log":"\u0009/path/to/project/rpc.go:185 +0x161 fp=0xc420201b38 sp=0xc420201a40 pc=0x809421\n","stream":"stderr","time":"2019-11-01T02:52:52.773704869Z"}
11{"log":"runtime.call64(0xc420187f20, 0xc420192078, 0xc420358b10, 0x1800000028)\n","stream":"stderr","time":"2019-11-01T02:52:52.773713993Z"}
12{"log":"\u0009/app/go1.9/go/src/runtime/asm_amd64.s:510 +0x3b fp=0xc420201b88 sp=0xc420201b38 pc=0x45b86b\n","stream":"stderr",
13...

显示是并发读写map出错,看下这行代码:

1rc := b.Demo(111,"foo")
2// /path/to/project/rpc.go:185
3for rid, count = range rc {
4    rep.Counter[rid] += count
5}

rc的类型是map[string]int32,对它访问没加锁因为预期没有并发写,看下来源:

1func (b *B) Demo(s int32, a string) (rc map[string]int32) {
2	b.bLock.RLock()
3	rc = make(map[string]int32, len(b.rc))
4	rc = b.rc[s][a]
5	b.bLock.RUnlock()
6	return
7}

b.rc是一个会被并发读写的map,所以对它的读操作需加锁。问题出在赋值语句:

1rc = b.rc[s][a]

map的本质是*hmap,即hmap类型的指针,所以变量rcb.rc[s][a]中保存的是同一个指向底层hmap的地址。Demo()方法返回后,外面对该地址的读操作并没加锁,从而导致报错。这里代码的本意是复制map,只是用了错误的方法。正确的做法是循环遍历map的每个元素逐个赋值到新的map。例如:

 1func (b *B) Demo(s int32, a string) (rc map[string]int32) {
 2	b.bLock.RLock()
 3	m := b.rc[server][appId]
 4	rc = make(map[string]int32, len(rc))
 5	for rid, count := range m {
 6		if count > 0 {
 7			rc[rid] = count
 8		}
 9	}
10	b.bLock.RUnlock()
11	return
12}

微服务框架go-micro中的对头信息的复制过程可以作为参考。